Quantitative assessments of different multifocus image fusion methods on different source images.

## Abstract

Image fusion is a research topic about combining information from multiple images into one fused image. Although a large number of methods have been proposed, many challenges remain in obtaining clearer resulting images with higher quality. This chapter addresses the multifocus image fusion problem about extending the depth of field by fusing several images of the same scene with different focuses. Existing research in multifocus image fusion tends to emphasis on the pixel-level image fusion using transform domain methods. The region-level image fusion methods, especially the ones using new coding techniques, are still limited. In this chapter, we provide an overview of regional multi-focus image fusion, and two different orthogonal matching pursuit-based sparse representation methods are adopted for regional multi-focus image fusion. Experiment results show that the regional image fusion using sparse representation can achieve a comparable even better performance for multifocus image fusion problems.

### Keywords

- image fusion
- multifocus
- region
- image segmentation
- sparse representation

## 1. Introduction

The depth of field is usually limited in current imaging systems using conventional sensors like CCD cameras. Hence, the image we obtained is usually only partly in focus, and the objects in focus are captured more sharply and clearer. However, to accurately analyze the images, having all objects in focus is desired [1]. Multifocus image fusion is an effective approach to extend the depth of field by combining several images of the same scene with different focuses and to provide a better view for human perception. The function of multifocus image fusion is illustrated in Figure 1. The white boxes in the source images indicate the regions in focus.

In recent years, the technique of multifocus image fusion has been broadly used in various application fields such as biochemical analysis [2], medical image processing [3], remote sensing [4], and other areas [5]. Many novel multifocus image fusion methods have been proposed, and they can be categorized into the pixel-level fusion methods and the region-level fusion methods. In pixel-level image fusion, source images are usually fused by considering the pixel-wise features to make the decision of fusion. There are several advantages for pixel-level image fusion, such as extracting full of original information in the source images and easy to implement. However, the pixel-level image fusion is sensitive to noise, which will cause the wrong pixel choosing from corresponding source images. Recently, many multiscale transform-based pixel-level image fusion methods are very popular because these methods can keep more sharpness and edge information in source images. The benefits of different transforms, such as discrete wavelet [6], curvelet [7], contourlet [8], and so on, have been well explored. However, due to the pixel-level coefficients selection and less considerations of spatial information, some artifacts may be found in the fused images.

To address the weaknesses mentioned above and employ more spatial information in images, the manner of fusion can be changed from pixel by pixel to region by region. Few regional multifocus image fusion methods have been proposed. For instance, Omar et al. have proposed a region-based image fusion method using a combinatory Chebyshev-ICA method [9]. Li et al. have proposed a regional image fusion method using spatial frequency [10]. Regional multifocus image fusion usually contains the following steps: the image segmentation/partition and the fusion/merging of different regions. The general process of region-level multifocus image fusion is illustrated in Figure 2. Each source image is initially partitioned in some way to produce a set of regions. Various properties of these regions can be calculated and used to determine which regions from which source images are to be included in the fused image. This has advantages over pixel-level methods as more semantic fusion rules can be considered based on regional features in the image. Finally, the selected regions in focus are combined into the fused image.

The regional multifocus image fusion is the major contents of this chapter. In the following section, we first introduce on how to express the image patches using two kinds of sparse representation algorithms: orthogonal matching pursuit (OMP) algorithm and simultaneous orthogonal matching pursuit (SOMP) algorithm. Besides, how to calculate the focus measure using the obtained sparse coefficients is also described. Second, two regional multifocus image fusion schemes based on different sparse representation algorithms are given in Section 3, and the corresponding fusion processes are introduced. Experiments are conducted based on some source image pairs with different depths of field. To evaluate the performance of the new methods, we conduct the comparison of some state-of-art methods and provide the fusion results in Section 4. Finally, the conclusion and future work are given in Section 5.

## 2. Sparse representation theory and clarity measure

As shown in Figure 2, one important module in regional multifocus image fusion is the focus measure (this is also important in pixel-by-pixel fusion). Many coding schemas for images such as wavelet and EMD have been used for this purpose [11, 12]. Here we use more recently proposed coding schema, i.e., the sparse representation.

As an extension to wavelet transform (WT), sparse representation has become a popular tool widely used in image or signal processing tasks such as compressed sensing [13], image de-noising [14], image classification [15], and face recognition [16]. In sparse representation, the image patches in original image are usually represented as linear combinations of a “few” atoms from an overcomplete dictionary [17]. Figure 3 shows the sparse linear model. In this model, an image patch can be expressed by a sparse vector **α**, and

where the overcomplete dictionary is *D* = {*d*_{1}, *d*_{2}, …, *d*_{N}} and the number of atoms is *N*. There are many methods to generate the overcomplete dictionary, such as K-SVD [18] and discrete cosine transforms (DCTs), and they can be directly created from some images by learning [19]. According to the dictionary *D*, the image signal can be represented by the sparse coefficient **P** = {**p**_{1}, **p**_{2} …, **p**_{N}}.

The number of the non-zero entries in coefficient *P* is ‖*P*‖_{0}. According to the sparse representation theory, the smaller is ‖*P*‖_{0}, the sparser the image patch can be represented. Therefore, we need to minimize the ‖*P*‖_{0}, which can be formulated as follows:

where *ε* means the global tolerance of error. We can solve the optimization problem in Eq. (2) by greedily testing the possible combinations of columns of *y* [20]. In such kind of greedy algorithms, the orthogonal matching pursuit (OMP) is widely used, and we refer the readers to Ref. [21] for details of the OMP algorithm.

Unlike the OMP algorithm that works on signals (image patches) separately, if we fix the dictionary *D* to be used in representing several signals at the same time, we are attempting to derive the sparse coefficients for several signals simultaneously by solving the following optimization problem,

where *P* means the sparse coefficients for a set of signals (image patches *Y*) and ε is the error tolerance. *Y*. The assumption that the several image patches are sparse represented by the dictionary simultaneously is valuable to the multifocus image fusion problem because the image patches at the same location of different source images are regarded as perceptions of the same objects. The optimization problem in Eq. (3) can be solved by another greedy algorithm called simultaneously orthogonal matching pursuit (SOMP); its details can be found in Ref. [22].

In multifocus image fusion, some clarity measures should be used to see the image pixel/region in focus or not. Therefore, no matter which sparse representation algorithm we adopt to obtain the sparse coefficients, how to use the derived coefficients to define the clarity measure is an important step for multifocus image fusion.

When we have the sparse coefficient *P*_{i} for one image patch *i*, we can calculate the information embedded in this coefficient by summarizing all the absolute values of elements in *P*_{i}. More specifically, the information level in this patch is *F*_{i} = ║*P*_{i}║, here║·║is the Manhattan norm of *P*_{i}. Considering the out-of-focus patch will be smoother than the in-focus patch, the information or details contained in the out-of-focus patch will be in lower level. So the information level defined by Manhattan norm can be regarded as a decent indicator or clarity measure on if the patch is in focus or not [23].

By window sliding technique, each source image can be reshaped into a series of image patches. And then each image patch can be changed into a vector, which can be denoted by sparse coefficients. Assuming that the overcomplete dictionary contains *T* atoms and each source image can be divided into *r* patches, we can get all the vectorized patches as follows:

where *P* is the sparse coefficient matrix.

In the following, we calculate the Manhattan norm of the vectors as the clarity measure. The metric is also applied in Ref. [24], and we call them the activity levels or clarity levels of responding patches.

where ‖*P*_{i}‖_{1} is the Manhattan norm.

## 3. Regional image fusion using sparse representation

In traditional sparse representation-based image fusion methods, the spatial information of source images is less considered because we only calculate the sum of absolute sparse coefficients as the activity level of image patch and apply the choose-max fusion rule pixel by pixel. This may lead to ringing effects-related distortions in the fused image. On the contrary, in the regional image fusion approach, the source images are first partitioned by an image segmentation method, and then according to different sharpness measures, the sharp regions are used to construct the fused image. A lot of image segmentation algorithms such as normalized cuts [25], watershed-based segmentation [26], and others [27] have been proposed. Currently, most of image segmentation algorithms are quite complicated and time consuming.

In traditional regional multifocus image fusion, less consideration on the effects of focus in source images may increase the risk of bad segmentation in some images because the features of in-focus and out-of-focus pixels sometimes are very similar. Using region-by-region selection, if the in-focus and the out-of-focus pixels are segmented in the same region, the traditional regional image fusion approach cannot avoid to get some out-of-focus pixels in the final fused image, and the clarity level of the fusion result will be decreased. In the following subsections, to alleviate this weakness of traditional regional image fusion, we provide two new approaches with more considerations on the clarity information in the image segmentation step.

### 3.1. Regional multifocus image fusion using OMP algorithm

The first new regional multifocus image fusion approach is shown in Figure 4. This method is also viable for the case of fusing more than two source images, but here we just use two source images for simplicity. This approach includes a new operation before the image segmentation step. That is, after obtaining the clarity information from sparse coefficients, we produce a new clarity enhanced image for further image segmentation by linearly combining the average of source images and the clarity information. We detail the three stages of the first proposed approach in the following subsections.

#### 3.1.1. Clarity measure based on sparse representation

To obtain the sparse coefficients *P*_{X} and *P*_{Y}, OMP algorithm is adopted in the first stage of fusion process [24]. Next, by Eq. (6), the clarity levels of the source image patches *F*_{x} and *F*_{Y} can be calculated with the sparse coefficients *P*_{X} and *P*_{Y}. Then we can get the clarity level images *X*_{P} and *Y*_{P}, in which, by averaging the clarity levels of all the patches that cover the pixel, the clarity level of a pixel at a specific location is obtained. Finally, the relative clarity level images *X*_{P}^{’} and *Y*_{P}’ can be obtained as follows:

#### 3.1.2. Segmentation based on clarity enhanced image

The clarity enhanced image is constructed by the relative clarity measures *X*_{P}^{’}, *Y*_{P}’, and the source images. Here, we normalize the source images into the interval [0, 1] and denote them as *X*^{’} and *Y*’. The clarity enhanced image *ZZ* is obtained by

where β is used to adjust the contribution of the relative clarity measure and the original information of source images. The clarity enhanced image *ZZ* is segmented to many regions by normalized cut algorithm.

The image to be segmented is usually generated by simply averaging of source images in traditional regional multifocus image fusion. By adding the focus information to the source images, it is also considered as the feature in the segmentation process and the segmentation results get lower possibility of having in- and out-of-focus pixels in one segment, and the risk of incorrect segmentation is decreased.

#### 3.1.3. Regional image fusion

After the stage of image segmentation, we obtain the partition of *ZZ*. So we divide the normalized images *X*^{’} and *Y*’ into the homogenous regions according to the segmentation results over ZZ. In regions of the corresponding position of source images, we calculate the mean value of clarity level of each region of *X*^{’} and *Y*’, compare the means and use the choose-max-mean rule to select the regions in focus. With the selected regions, the fused sparse coefficient matrix *P*_{F} can be obtained by using the corresponding column vectors of *P*_{A} and *P*_{B}. According to Eq. (12), the vectors of the image patch in the fused image can be calculated as follows:

where *P*_{F} is the fused sparse coefficient matrix and *D* is the overcomplete dictionary.

Finally, each vector in *V*_{F} is reshaped into a patch. And all the image patches are put into the fused image according to their corresponding positions in source images. The final fused image is obtained by averaging all the recovered patches.

### 3.2. Regional image fusion using SOMP algorithm

In this subsection, we introduce the second regional multifocus image fusion approach using sparse representation. In classical sparse representation, OMP algorithm is usually used to obtain the sparse coefficients by solving the related non-convex optimization problem. Different from OMP, the dictionary in SOMP can be used for decomposing several image patches simultaneously. More specifically, for a location, we can extract one patch from each of the source images. Then, these patches are to be rebuilt by some sparse set of atoms from the dictionary simultaneously. Because the patches in different source images contain the same visual information in one location, obtaining sparse coefficients together is more advantageous. Combing SOMP and the regional image fusion schema in previous subsection, a new region-based multifocus image-fusion method using the guided filter and greedy analysis is proposed here [28]. The illustrative scheme program of this method is shown in Figure 5.

There are also three stages in this scheme. In the first stage, guided filter is adopted for enhancing the details of source images and then we obtain the sparse coefficients using SOMP algorithm. By doing this, more accurate sparse coefficients are obtained from images with details sharpened by guided filter. In addition, the filter enhanced edge information is introduced into the image to be segmented and enhances the segmentation results eventually. The remaining two stages in this fusion approach are the same as the ones in our first proposed approach.

#### 3.2.1. Guided filter and image fusion

Guided filter is proposed by He et al. in 2010 [29]. For the purpose of edge preserving, guided filter has been demonstrated better than bilateral filter, which is also used for detail enhancement. Li et al. have proposed a novel guided filtering-based weighted averaging image fusion method using spatial consistency [30]. As we know, the more feature information the processed images have, the clearer fused image can be obtained. Figure 6 shows the process of guided filtering. Here, we just introduce this filter briefly.

Mathematically, the guided filter uses a local linear model as follows:

where *Q* is the output of guided filter, *G* is the image used to guide the filtering process, *w*_{t} is the sliding window, and (*s*_{t}, *m*_{t}) is usually constants in *w*_{t}.

By taking the difference between the filter output and the filter input as cost function and minimizing it, we can get the best values (*s*_{t}, *m*_{t}) by a simple linear regression problem as follows:

Here, to avoid that the value of *s*_{t} is too large, we bring in a regularization parameter ε in Eq. (14). With the optimized (*s*_{t}, *m*_{t}), we can obtain the filtered image by averaging all the patches generated by Eq. (13). More details can be found in Ref. [29].

For the source images processed by guided filter, we apply sliding window to get a set image patches. For the same location in each image, the corresponding patches are simultaneously decomposed by the same subset of atoms using SOMP. Then taking the same steps as the ones in previous subsection, we use the obtained coefficients to calculate the clarity measure, obtain the clarity enhanced image, conduct the image segmentation, and fuse the image regionally according to the segmentation result.

## 4. Experiments and results

In this section, we will evaluate the performance of the proposed regional multifocus image fusion methods and compare them with following four methods. They are multifocus image fusion using DWT [31], multifocus image fusion using guided filtering [30], multifocus image fusion using sparse representation [20], and regional multifocus image fusion using spatial frequency [10]. For simplicity, we use DWT, GF, SR, and RIFSF to indicate these methods, respectively. Here, DWT-based method and GF-based method are pixel-level multifocus image fusion methods. SR-based method and RIFSF-based method are regional multifocus image fusion methods. The platform that we use to conduct the image fusion experiments is Matlab 2014b.

### 4.1. Data

The test images are obtained from Ref. [32]. Four pairs of source images are shown in Figure 7, which are named as “book,” “balloon,” “flower,” and “leopard,” respectively. There are different depths of focus in every pair of images.

### 4.2. Results

#### 4.2.1. Image segmentation results

For simplicity, we call two fusion methods introduced in this chapter RIFOMP and RIFSOMP. Four other image fusion methods are compared with these two methods. The fusion results will be totally different if the setting parameters are different. So in order to do the comparison fairly, we use all the settings of parameters the same as the ones in the papers [10, 20, 30, 31]. In the proposed methods, after obtaining the image patches by window sliding technique, based on normalized cuts on clarity enhanced image, the corresponding segmentation results are shown in Figure 8. From the segmentation results, we can see that the in-focus and out-of-focus pixels are basically divided into different regions.

#### 4.2.2. Image fusion results

Our multifocus image fusion is conducted based on segmentation results, and the fusion results for source images above are shown in Figures 8–12. The other method results are also listed for the purpose of comparison. From these figures, we can see that the fused images produced by DWT-based method are not so clear. Besides, there are incorrect region selections in some fused images produced by the RIFSF-based method. For example, in Figure 10(f), the boundary of balloon is blurred because of this problem. The results of other four methods are visually similar, so we further conduct the quantitative comparison according to several image quality indexes.

### 4.3. Quantitative evaluation

Besides the subjective evaluation of the fusion results, the objective criteria are also used to evaluate the image fusion results quantitatively. Six popular performance criteria that we adopt are Petrovic metric (*Q*^{AB/F}) [33], mutual information (MI) [34], root mean square error (RSME) [35], peak signal-to-noise ratio (PSNR) [36], structure similarity measure (SSIM) [37], and correlation coefficient (CC) [38].

Petrovic metric (

*Q*^{AB/F}):*Q*^{AB/F}is used to evaluate that the edge information transferred from the source images to the fused image. Generally speaking, the larger the value of*Q*^{AB/F}is, the better the fusion result is. The value of*Q*^{AB/F}is always smaller than 1.Mutual information (MI): MI is used to measure the dependence between the source images and the fused image. It is a good indicator of information shared by the fused image and the source images, therefore the higher the better.

Root mean square error (RMSE): RMSE is used to denote the difference of standard deviation between the fused and source images. A better image fusion result has a smaller RMSE value.

Peak signal-to-noise ratio (PSNR): PSNR is widely used to measure the similarity of multiple images (the source images and the fused image). If the value of PSNR is higher, the fusion result will be better.

Structure similarity measure (SSIM): SSIM is used to measure the structure distortion between the source images and the fused image. The higher the value of SSIM is, the lower structure distortion is and the better the fusion result is.

Correlation coefficient (CC): CC is often used to indicate the degree of correlation between the source images and the fused image. If the value of CC approaches 1, the correlation of the source images and the fused image is very strong.

According to these six measures, the comparison results are shown in Table 1. From this table, we can see the MI values of two proposed methods are better than the ones of others for source images “book,” “balloon,” and “flower.” For the source image “leopard,” the proposed method RIFSOMP is still the best. According to *Q*^{AB/F}, the two proposed methods also perform very well. For example, the RIFSOMP gets the best *Q*^{AB/F} for image “balloon.” We also list the average performance of each method in comparison. From the average performance in each criterion, we can safely say that the performance of two proposed regional methods is at least comparable to the best of other state-of-the-art methods and according to some specific performance indices such as MI and *Q*^{AB/F}, the proposed RIFSOMP is even superior to other methods in comparison. We conclude that the focus information is well preserved by our methods in the fused images, and there are no obvious artifacts in fusion results.

Source images | Quality measure | Method | |||||
---|---|---|---|---|---|---|---|

DWT | GFF | SR | RIFSF | RIFOMP | RIFSOMP | ||

Book | Q ^{AB/F} | 0.7715 | 0.7985 | 0.8044 | 0.7902 | 0.8009 | 0.8010 |

MI | 7.1809 | 8.9691 | 8.2072 | 9.2945 | 9.7756 | 9.7775 | |

RMSE | 0.0377 | 0.0117 | 0.0124 | 0.0119 | 0.0109 | 0.0110 | |

PSNR | 27.7426 | 37.9276 | 37.4354 | 37.8021 | 38.4568 | 38.4451 | |

SSIM | 0.9145 | 0.9143 | 0.9242 | 0.9148 | 0.9232 | 0.9229 | |

CC | 0.9850 | 0.9840 | 0.9861 | 0.9843 | 0.9857 | 0.9854 | |

Balloon | Q ^{AB/F} | 0.8133 | 0.8218 | 0.8160 | 0.8007 | 0.8208 | 0.8220 |

MI | 10.1277 | 11.1296 | 10.3557 | 11.1252 | 11.1355 | 11.1632 | |

RMSE | 0.0114 | 0.0056 | 0.0055 | 0.0055 | 0.0055 | 0.0056 | |

PSNR | 32.1260 | 38.3339 | 38.4714 | 38.4370 | 38.3358 | 38.2833 | |

SSIM | 0.9684 | 0.9689 | 0.9716 | 0.9685 | 0.9693 | 0.9687 | |

CC | 0.9914 | 0.9917 | 0.9923 | 0.9911 | 0.9918 | 0.9915 | |

Flower | Q ^{AB/F} | 0.6817 | 0.7270 | 0.7224 | 0.6933 | 0.7238 | 0.7240 |

MI | 5.0983 | 7.2740 | 5.6288 | 7.6310 | 7.8754 | 8.0319 | |

RMSE | 0.0304 | 0.0109 | 0.0094 | 0.0103 | 0.0108 | 0.0108 | |

PSNR | 25.5105 | 34.4559 | 35.7465 | 34.8986 | 34.5014 | 34.4631 | |

SSIM | 0.9041 | 0.8921 | 0.9352 | 0.9069 | 0.8922 | 0.8907 | |

CC | 0.9407 | 0.9275 | 0.9565 | 0.9387 | 0.9267 | 0.9256 | |

Leopard | Q ^{AB/F} | 0.8302 | 0.8356 | 0.8378 | 0.8275 | 0.8348 | 0.8357 |

MI | 9.8038 | 10.8384 | 10.0490 | 10.9832 | 10.9083 | 10.9901 | |

RMSE | 0.0364 | 0.0175 | 0.0116 | 0.0177 | 0.0174 | 0.0176 | |

PSNR | 28.0629 | 34.3990 | 38.0155 | 34.3415 | 34.4489 | 34.3644 | |

SSIM | 0.9036 | 0.9045 | 0.9629 | 0.9040 | 0.9058 | 0.9038 | |

CC | 0.9881 | 0.9882 | 0.9949 | 0.9881 | 0.9884 | 0.9881 | |

Average | Q ^{AB/F} | 0.7741 | 0.7957 | 0.7952 | 0.7779 | 0.7951 | 0.7957 |

MI | 8.0527 | 9.5528 | 8.5602 | 9.7585 | 9.9237 | 9.9907 | |

RMSE | 0.1159 | 0.0114 | 0.0097 | 0.0114 | 0.0112 | 0.0113 | |

PSNR | 28.3605 | 36.2791 | 37.4172 | 36.3698 | 36.4357 | 36.3890 | |

SSIM | 0.9227 | 0.9200 | 0.9485 | 0.9236 | 0.9226 | 0.9215 | |

CC | 0.9763 | 0.9729 | 0.9825 | 0.9756 | 0.9732 | 0.9727 |

## 5. Conclusion and future work

As more advanced coding techniques appear, the fused image will be clearer by using regional multifocus image fusion methods. In this chapter, the general structure of regional multifocus image fusion is introduced. Regional multifocus image fusion methods using two different sparse representation algorithms, i.e., OMP and SOMP, are formulated. The experiments by proposed regional multifocus image fusion methods are conducted, and the experimental results demonstrate that the performance of regional methods using sparse representation is better than several state-of-the-art methods.

To further improve the regional image fusion’s performance, two possible directions can be explored more. The first one is to apply other coding techniques like the group or structure sparse representations to get more robust clarity measures. The second future work can focus on the re-designing of the image segmentation approaches. We can embed the focus information directly in the segmentation procedures to obtain better partitions of images for the regional fusion. For example, the biased normalized cut [39] is a possible solution to embed the focus information as the bias in the classical normalized cut segmentation algorithm.