Open access peer-reviewed chapter - ONLINE FIRST

Fusion of Color-Based Multi-Dimensional Scaling Maps For Saliency Estimation

Written By

Max Mignotte

Submitted: 08 June 2023 Reviewed: 31 July 2023 Published: 02 April 2024

DOI: 10.5772/intechopen.113077

Digital Image Processing - Latest Advances and Applications IntechOpen
Digital Image Processing - Latest Advances and Applications Edited by Francisco Cuevas

From the Edited Volume

Digital Image Processing - Latest Advances and Applications [Working Title]

Dr. Francisco Javier Cuevas

Chapter metrics overview

2 Chapter Downloads

View Full Metrics

Abstract

This work presents an original energy-based model, using a pixel pair modeling combined with a fusion procedure, to the saliency map estimation problem. More precisely, we formulate the saliency map segmentation issue as the solution of an energy-based model involving pixel pairwise constraints, in terms of color features, to which are then added constraints of higher levels of abstraction given by a preliminary over-segmentation whose location of regions but also contour information are exploited. Finally, this segmentation-driven saliency measure solution is then expressed in different color spaces which are combined together in order to take into account the specific properties of each of these color models with a outlier rejection scheme. Experimental results show that the proposed algorithm is both simple, efficient by performing favorably against state-of-the-art methods and also perfectible.

Keywords

  • color spaces
  • energy-based model
  • fusion procedure regions of interest
  • saliency map estimation
  • salient object detection
  • FastMap optimization
  • outlier rejection scheme
  • image partition
  • modeling with pixel pairs or constraints
  • tracking

1. Introduction

Saliency detection (SD) is generally defined as the detection and segmentation of the most important or interesting objects or visual elements in the image scene which immediately and naturally attract and hold the attention of the viewer.

This algorithm basically attempts to imitate the natural eye-fixations process of human visual system (HVS) which has the amazing capability to rapidly detect and precisely localize the most noticeably visible and informative object or region within a (potentially very) cluttered image. In fast, this mechanism, which has developed during the evolutionary process, allows the human (and most mammals) to quickly and efficiently analyze the scene and focus its attention, on the important objects or regions of this one, with a minimal allocated processing (visual) resources.

SD is a low-level image processing task which is important in a variety of vision or image processing applications, especially for those where there is a need to reduce the information overload. This kind of problems can be commonly encountered in content-based image/video retrieval, summarization, categorization, compression, or browsing, automatic resizing, image cropping, advertising design or adaptive image display on small devices to name a few.

A significant number of methods have used (with some variations) the same principle proposed by the pioneer work of Itti and Koch [1] based on local contrast concepts and modeling the fact that a salient object/region presents a strong contrast with respect to its existing spatial neighborhood [2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]. Another class of algorithmic procedures is rather based on the hypothesis that the salient element is remarkably distinct; i.e., it exhibits a (or a mixture) of unique and discriminating colors (eventually combined with textural features) compared to the rest of the image [6, 10, 13, 14, 15, 16, 17, 18, 19, 20].

Nevertheless, local contrast-based algorithms tend to produce over-estimated saliency values close to the edges and contours of objects while producing too low saliency values inside objects instead of highlighting them uniformly and are also affected by high-frequency spatial signals in the image, while techniques based on global contrast can not differentiate regions precisely and are often altered by the presence of cluttered background [21, 22].

That is why some SD models propose to combine local and global contrast-based techniques either by combining local and global features [23] or by combining local contrast cues via a multi-layer or multi-scale approach [19, 21], or a tree structure [18] or have based their saliency estimation on a multi-level segmentation [17].

Other approaches are modeled within the conceptual framework of quantum mechanics [22], bag-of-features [24], matrix decomposition [25] or have been developed in the regression (or equivalently energy-based) framework [17, 21, 26] sometimes guided by the Gestalt-laws [27]. Other approaches are based on graph theory [28] or image statistics possibly obtained in advance from a database of natural images [29] or modeled within the Bayes statistical framework and only based on the statistics of the input image [30, 31] or eventually by using conditional random field (CRF) [23, 32, 33, 34, 35, 36], Bayesian model [37] or Markov random field (MRF) [38] generally by formulating the saliency estimation as a random walk problem [4, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49] or via an absorbing Markov chain [50] possibly guided by depth information [51].

Contrary to the above-mentioned strategies, our saliency model is based on the combination of three contributions:

  • The first one relies on a generic energy-based model involving pixel pairwise constraints or using an original modeling by pair of pixels, giving a first and rough saliency map (based on the color cue) whose solution is then efficiently and quickly obtained with a linear computational complexity (in terms of number of pixels in the image).

  • The second one is a set of original and efficient constraints based on a preliminary over-segmentation whose location of regions but also contour information are exploited as constraints to improve the previous saliency map result.

  • Finally, the third contribution is the use of a simple but efficient fusion procedure combining the saliency result expressed in several color systems with a robust outlier rejection scheme.

The remainder of this paper is organized as follows: Section 2 describes the proposed model. Section 3 reports and evaluates the performances achieved by our model by comparing them with state-of-the-art methods and then discuss the reliability, strengths and interest of the proposed approach and the quantification (in terms of classification gain) of each component of the proposed model. Finally, Section 4 concludes the paper.

Advertisement

2. Proposed model for the saliency map estimation problem

2.1 Color-based multidimensional scaling saliency map

Our saliency model is first based on the hypothesis that the salient region or element to be detected is remarkably distinct; that is to say that it has discriminating colors compared to its surrounding spatial neighborhood. To this end and in order to provide a rough but meaningful first soft saliency measure map x̂ (with values varying from 0 to 1 representing the saliency probability of each pixel in the sense of a given criterion), we can search the solution map x̂ that will translate the distance between the mean color existing between each pair of pixels in the input image by an identical gray-scale distance between each pair of pixels in the resulting saliency map to be estimated (or otherwise said by estimating x̂ that preserves, as much as possible, the color or gray-scale distance between respectively each pair of pixels of the input image and the soft saliency map). Mathematically speaking, this can be efficiently done by searching the map x̂ that minimizes the following energy-based or cost function (which is also commonly called the stress function):

x̂=argminxij=1,,Nyiyj2xixj221/2E1

where N is the number of pixels of the input and the saliency images and the summation ij=1,,N is on all pixel pairs (i.e., for all pixels i and for all pairs of pixels involving i) being within the image. xi corresponds to the gray level at pixel location i and yi is a three-dimensional (3D) vector coding the three mean color channel values computed within a local squared region of size Nc×Nc centered on the pixel at location i (of the input color image y). The notation .2 refers to the Euclidean distance.

This stress function given in Eq. (1) is often minimized using procedures called stress majorization. This class of techniques act as an efficient global minimizer of this loss function consisting of residual sum of squares by integrating all the pairwise constraints. Conceptually, this kind of optimizers treats each distance or constraints between a pair of pixels (in terms of gray level difference in our application) like a rubber band between the couple of pixels, and searched to reorganize the gray-scale value of each pixel of x̂ to minimize the tension or the stress of the rubber bands (hence the name of stress measure) or in an equivalent way, to fulfill all constraints in a least-squares (LSQ) problem formulation. Multidimensional scaling (MDS) is the most known stress majorization procedure allowing to minimize such a stress function with pairwise constraints and this optimizer has proven to be particularly effective in its ability to solve certain image processing problems requiring a modeling by pair of pixels (e.g., such as color image segmentation [52, 53], hyperspectral compression [54, 55], asymmetry detection [56], human action recognition [57], fusion procedures [58], histogram specification algorithms [59] high dynamic range compression (HDR) [60], multimodal change detection [61] and database browsing and visualization [62] to name a few).

However, the MDS algorithm which is initially proposed (and named metric MDS [63, 64]) is not ideally suited to our application (as well as for all large scale applications) due to its computational and memory load. Indeed, it requires a ON2 computational complexity and an entire N×N distance matrix to be stored in memory. Instead, we have resorted to a faster (and near-optimal) variant here, named FastMap [65] whose major benefit lies in its linear computational complexity (through an efficient Nyström [66, 67] approximation1). In addition, it worths mentioning that the efficiency of the FastMap is all the more important as the dimension of the final mapping is low [67]. That’s also why this optimizer is particularly well suited to our application since the final (saliency) mapping has to be estimated at very low dimensions (i.e., one dimension since we finally want to obtain a 1-channel gray-scale saliency map2).

Let us now recall that a configuration x̂ that minimizes Eq. (1) will give us a gray-scale map in which the pixel pairs that are close in terms of mean color in the input image y will also give us pixel pairs, in the saliency map x̂, with gray-scales that are faithfully close (in the LSQ sense). Nevertheless, the configuration, given by the FastMap technique, may optionally generate an image with an inverted gray-scale scale (negative of an image or/positive inversion) (i.e., with values of gray levels all the lower as the regions are estimated as salient and vice versa). In order to rearrange this gray-scale mapping in the appropriate direction (with correspondingly higher gray-scale values for the most salient regions), we simply assume that a salient element/region is a priori more likely to appear in the image center [17, 21, 26, 29] (or conversely unlikely at the edges of the square image frame). To do this, we calculate the Pearson correlation coefficient between the x̂ saliency map estimated by the FastMap and a rectangle, with maximum intensity value and about half the image size, and located in the center of the image, more precisely, a rectangle beginning and ending respectively at the row lgth/4 and lgthlgth/4 and respectively at the column wdth/4 and wdthwdth/4 where lgth and wdth represents the length and the width of the image. If the correlation coefficient is negative (anti-correlation), we must then associate to each pixel (of x̂) its complementary gray value.

2.2 Segmentation-based adaptive central location constraint

In order to improve the saliency map result x̂, we decide to incorporate an adaptive location constraint favoring the fact that salient objects are more likely to appear in the segmented regions locating in the center of the image. To this end, we first exploit an over-segmentation of the input image and more precisely the concept of super-pixel which has already been successfully used, in different ways, in several saliency SD models [17, 18, 26, 30, 68, 69, 70]. An over-segmented region or super-pixel brings together a group of pixels (forming a perceptually meaningful atomic region allowing to preserve the important structures in the image) which can be exploited to replace the rigid pixel grid structure in images. Herein, we use the super-pixel technique introduced by Felzenswal [71] with the settings (default values) proposed by the authors (see Figure 1c).

Figure 1.

Refinement of the MDS-based saliency map x̂ [see Eq. (1)] with the two segmentation-based constraints favoring the fact that salient objects are 1—more likely to appear both in the segmented regions locating in the center of the image and 2—with a strong contour (see Section 2.2) in the RGB color space. (a) Input ECSSD image (b) Ground truth binary salient mask (c) Segmentation into regions given by the FH algorithm (d) Potential c favoring the central regions of the image (e) Potential c favoring the central regions of the image (f) Relative strength (in terms of average gradient magnitude) of the boundary of each region (g) Potential c combined with the segmentation into regions and weighted by the relative strength of the boundary of each region (h) Initial saliency map (i) Saliency map combined with the map obtained at step (g).

In addition to the location of the different regions, this segmentation allows us to estimate the strength of the contour of each region relatively to the other regions. In our application, this visual cue is simply expressed as the average magnitude of the gradient per pixel of the contour of the concerned region divided by the average gradient (per pixel) of all the contour points given by the segmentation (thus giving a positive number ξ all the greater as the region is prominent in the sense of the strength of its outline). This visual cue allows us to incorporate in our model the fact that a saliency region is more likely to appear with a strong contour (see Figure 1f).

Finally, in our model, the set of pixels belonging to each segmented region Ri is assigned the mean the initial saliency measure (within this region) firstly estimated by x̂ and the mean of the potential favoring saliency regions in the center of the image (see Figure 1e) and finally weighted by ξRi, the relative strength of the boundary of Ri (Figure 1f):

saliencyRi=ξRi1RisRix̂scsE2

with cs the potential at site s of coordinates (row,col) favoring saliency regions in the center of the image (and disfavoring them on the edges of the image) defined in the following way:

crow,col=22maxrowlgth/2lgth/2colwdth/2wdth/2E3

where lgth and wdth represents the length and the width of the image and crow,col is the pixel gray-scale value of the map c at coordinates (row, col). The refinement of the MDS-based saliency map x̂ with these two constraints (see Figure 1i) is noted in the following x̂c. One of the undeniable advantages of these two types of constraints is that they both do not depend on any internal parameter to be adjusted or tuned afterwards (thanks to their inherent adaptation to each image). The effectiveness of these two constraints will be quantified individually in Section 3.3.Algorithm 1. FastMap.

Algorithm 1: FastMap

Input:

k: Target space dimension

Np: Object (vector) number in database O

Output:

XNpk: Array of objects in target space

Initialization:

d0

FASTMAP ALGORITHM (k, D(),O)

  • if k0 then return X

  • dd+1

  • Choose pivot objects Oa and Ob so that the metric DOaOb is maximal

foreach object i from O do

  • Project Oi on the line OaOb

     Compute: Xid=xi

     xi=D2OaOi+D2OaObD2ObOi2D2OaOb

end

foreach object i from O do

  • Project Oi on an hyper-plane perpendicular to the line OaOb

      D2OiOj=D2Oi.Ojxixj2

end

call FASTMAP(k1,D,O)

2.3 Fusion in different color spaces

In order to further improve our final saliency estimation, the segmentation-driven saliency map obtained by the two previous steps (cf. Sections 2.1 and 2.2) is repeated for different and non-linearly related color spaces which are then combined with a fusion method. Let us note that this strategy has already proven its efficiency to improve the accuracy of an image segmentation with different fusion methods optimal in a particular criterion sense [58, 72, 73]. Let us recall that a color space is just an arbitrary model that combines three numbers (or tristimulus values) to describe a sensation of color that our vision system can possibly perceive. In a fusion model combining several color systems, each color space can be in fact regarded as different image channels provided by different sensors or as different complementary filters. In our application, we use Ns saliency maps provided by the Ns=8 following non-linearly related color spaces; C=LUVHSVLABRGBYIQXYZHSLTSL. Each color space has a specific and interesting property3 [74, 75, 76, 77, 78, 79], which can efficiently be taken into account and makes the fusion process very efficient (and whose effectiveness will be quantified in Section 3.3).

In our application, the fusion model is simply performed by averaging the Ns1 more reliable saliency maps achieved in these different color spaces (after their individual normalization in order that all values are between 0 and 255). In order to eliminate the less reliable saliency map which is identified in our fusion model as an isolated outlier, we just calculate the one that is least correlated with all the others in the Pearson’s correlation sense (see Figure 2). This fusion strategy will be quantified in terms of F-measure gain in Section 3.3.

Figure 2.

Saliency map x̂c obtained by the two previous steps (cf. Sections 2.1 and 2.2) estimated respectively in the (a) RGB (b) HSV (c) LAB (d) LUV (e) YIQ (f) XYZ (g) HSL and (h) TSL color spaces and (i) final saliency map obtained by our fusion model [averaging of the seven more reliable saliency maps. For this ECSSD image, the outlier which is automatically eliminated is the saliency map estimated in the TSL (h) color space].

Advertisement

3. Experimental results

3.1 Set-up and dataset description

First, to limit the computational load, we decided to resize all the images so that the maximum (height, width) of the image is 200 pixels. In the following, all parameters are defined relative to this new base image size.

The only internal parameter of our model is the size of the squared window Nc (introduced in the initial MDS-based saliency map, see Eq. 1). In our experiment, we have tuned this internal parameter in order to maximize the F measure on a subset of 10 images randomly extracted from the ECSSD dataset. This was done by applying a fixed step-size grid search method to select the best parameter (over the space of parameters) and in possible ranges of parameter values (namely Nc325 [step-size: 2]). We have found Nc=19 and a quiet relative insensitivity of our model if this size parameter is held within plus or minus 30% of this value. We recall that the internal parameters of the super-pixel algorithm used in our model is set with the default values suggested by the authors [71]. In addition, it is worth also recalling that the two segmentation-based constraints applied on the initial MDS-based saliency map do not depend on any internal parameter (since these two constraints adapt to each image (see Section 2.2).

We have validated our algorithm on the extended complex scene saliency dataset (ECSSD) (see Figures 3 and 4) which was built by Yan et al. [21] and Shi et al. [80]. This database is composed of 1000 images containing various categories (natural or man-made objects) and different backgrounds (sometimes non uniform with possible change in illumination intensity and/or composed of several parts or including small-scale structures). In addition, multiple salient objects possibly exist in one image, while part of or all of them are sometimes transparent (or do not have a sharply clear boundary and/or obvious difference with the background) and are regarded, all the same, as salient decided by an expert. The binary mask or ground truths have been synthetically generated for each image with a well-defined and reliable protocol (more precisely, for each image, several experts have manually drawn a saliency map which was finally combined in one ground truth binary mask that minimizes the inter-subject label consistency in the majority vote sense. (see [80] for additional details). We have also tested our method on the 5000 images of the commonly used MSRA-5000 (or MSRA-5K) saliency dataset [23]. Images of MSRA-5000 are relatively less complex (with less variability) and therefore are therefore rightly considered as less challenging to process than those of the ECSSD image database [80]. Examples of visual comparison between our model and the state-of-the art (SOTA) CHS saliency model presented in [80] (and also the designer/constructor of the ECSSD dataset) on the first 8 images of this image database is shown in Figure 4.

Figure 3.

From lexicographic order, distribution of respectively the Fβ, F and mean absolute error (MAE) measures given by our MDSSME (multidimensional scaling map-based saliency map estimation) model on the 1000 images of the ECSSD dataset.

Figure 4.

Visual comparison between our model and the SOTA CHS saliency model presented in [80] on the first eight images of the ECSSD dataset. From left to right, ECSSD image, ground truth salient mask, CHS saliency map [80], our MDSSME saliency result.

3.2 Quantitative measure

Our quantitative assessments and experiments follow the framework presented in [10, 14, 68, 80]. First, the precision-recall curve, associated with the set of saliency measures achieved by our model is plotted. and then compared with other methods (see Figure 5a). In addition, since in many applications, high precision is preferable to high recall, we thus also estimate the Fβ-measure proposed in [68, 80], as a function of each possible threshold:

Figure 5.

Qualitative comparison on ECSSD. From top to bottom. Precision-recall curve for each possible threshold (within range 0,255) and Fβ-measure (see Eq. (4) and [80]) with β2=0.3 as a function of the threshold.

Fβ=1+β2precisionrecallβ2precision+recallE4

in which thresholding (within range 0,255) are applied, β2 is set to 0.3 as suggested in [10, 68, 80]4 and the obtained Fβ-measure is shown and compared to other methods (see Figure 5b).

Besides, we studied as a performance metric, the mean absolute error (MAE) [68, 80] which actually measures the quality of the weighted continuous saliency map (which could turn out to be, for some applications, more important information than the binary mask itself). Mathematically, the MAE measures the MAE between the soft saliency measurement map x and the binary ground truth xG (both normalized in the interval 01). The MAE metric is given by:

MAE=1hgth×wdthi=1hgthj=1wdthxijxGijE5

with i and j designating respectively the row and column coordinates of x or xG.

The precision-recall curve along with the Fβ-measure as a function of the threshold and the MAE distance are complementary measures to each other and help interpreting the global or the local behavior (for a given or optimal threshold) of a SD model.

3.3 Results and discussion

First of all, we have tested (see Figure 6) the efficiency (in terms of precision-recall curves) of the different main characteristics of our MDSSME model on ECSSD dataset. More precisely, Figure 5 shows the plot of the precision recall curve obtained by our model for respectively, for the first four curves; (1) a fusion process (see Section 2.3) involving one, two, four or eight space colors; thus quantifying the efficiency of our fusion strategy through different color spaces (while the other characteristics of our model being optimal and remain unchanged), (2) for the last three curves; our model without the a priori constraint that a saliency region is more likely to appear with a strong contour (gradient prior), or without the a priori constraint of the central location (see Section 2.2) or without any constraint, i.e. without the two previous constraint and the over-segmentation process (just the MDS estimation expressed and averaged over the eight color spaces).

Figure 6.

Precision-recall curves (and optimal F and Fβ-measures [see Eq. (4)]) for different variations of our MDSSME model on ECSSD [80] dataset.

We can notice that the fusion process through different color spaces is efficient and allows an interesting gain of approximately 12% (on average) on the results but beyond four color spaces, the improvement gain is negligible. Let us recall that the color spaces must be chosen different and as uncorrelated as possible; this guarantees both, an over-segmentation (and a set of constraints based on it), which are different but complementary. We also notice that the gradient-based constraint brings a slight 3% gain compared to the full model. Nevertheless, this constraint (combined with the over-segmentation constraint) remains interesting when one does not consider the commonly used central location constraint.

We have first evaluated our SD model on the ECSSD dataset and compare our results with local methods; IT [2], GB [4], AC [8], and global methods; LC [13], SR [6], FT [10], HC [14], RC [14], CA [15], LR [16], RCC [19], and CHS [80]. Figure 5 shows the plot of the precision recall curves and the Fβ curve [see Eq. (4)] as a function of threshold, and Table 1 lists the MAE distances achieved by these different SD methods compared to our model, on the ECSSD dataset.

AC [8]CA [15]FT [10]GB [4]HC [14]IT [2]LC [13]LR [16]SR [6]RC [14]RCC [19]HS [80]CHS [80]MDSSME
0.2640.3100.2700.2820.3260.2900.2940.2670.2640.3010.1870.2240.2270.265
0.2280.2500.2300.2430.2390.2480.2450.2150.2250.2640.1400.1530.1500.218

Table 1.

Quantitative comparison with [2, 4, 6, 8, 10, 13, 14, 14, 15, 16, 19, 80] in terms of MAE measurement on the ECSSD [80] (first column) and MSRA-5000 (second column) databases.

For the MSRA-5K dataset, the optimal Fβ measure we can achieve is Fβ=0.791, (for the particular threshold 136) and F=0.786 (β=1 for the threshold 106) and MAE=0.218 (see Table 1).

Overall, we obtain a competitive precision-recall or Fβ measure curve with a correct MAE when compared with existing state-of-the-art algorithms. Nevertheless, in terms of performance, the strong point of our method remains the very competitive optimal Fβ-measure (for a given optimal threshold) which is reflected graphically by the fact that the Fβ curve associated to our model reaches a maximum above all other curves (see Figure 6b) with a plateau wide enough for an estimation of the optimal threshold to be possible (this plateau is also centered around the mean value of the gray scale range, i.e. between [100–170]).

We have also compared our model with two very recent salient detection energy-based models, very recently published in the literature [20, 38]. These two energy-based models are based, for the first, on LTP (local ternary patterns) texture based features [20] and, for the second, on pairwise pixel based features [38] to capture the notion of saliency in a scene (and these two energy-based models operate both of them within a single specific color space). We have compared our model on the ECSSD and MSRA-5K datasets in terms of the optimal F-beta and MAE efficiency measures (see Table 2) which shows that our model is very competitive, in terms of these two highly complementary efficiency measures.

PPMRF [38]LTPSD [20]MDSSME
ECSSDFβ=0.727 MAE = 0.150Fβ=0.729 MAE = 0.257Fβ=0.734 MAE = 0.265
bf MSRAFβ=0.790 MAE = 0.108Fβ=0.781 MAE = 0.215Fβ=0.791 MAE = 0.218

Table 2.

Quantitative comparison with two recent salient detection energy-based models [20, 38] in terms of MAE and Fβ measurements on the ECSSD (first line) and MSRA (second line) databases.

Our current implementation takes on average 0.6 seconds to process one image with resolution 400×300 on a 8 GB and 3.33 GHz Intel i7 CPU (6675.25 bogomips) with an un-optimized C++ code running under Linux (or about 10 minutes for the processing of the entire ECSSD dataset composed of 1000 images). In the proposed algorithm, each MDS maps is calculated very quickly, in approximately 0.05 second due to the linear computational complexity (in terms of the number of pixels) of the FastMap algorithm and this, we recall, despite the fact that this algorithm will model and take into account a quadratic number of pairs of pixels [see Eq. (1)]. Let us also mention that our SD model can be easily parallelized by using an OpenMP implementation on several CPU cores (each MDS map estimated on each CPU core) or even better, by using a GPU implementation of the FastMap algorithm as proposed in [81], in order to reduce the computation time by a factor of 1000 (on a mid-range GPU card) and to allow the processing of the entire ECSSD dataset in less than 1 second.

Let us note that, in addition to being perfectible (by adding more color spaces) and possibly faster (by using GPU programming), our unsupervised energy-based model also has the property of depending on very few parameters. This defines a low complexity model, ensuring no over-fitting and good generalization properties and consequently, this allows our algorithm to have the advantage of performing well on other datasets.

In addition, and this also complements what has been stated previously, since our algorithm exploits the notion of colors in different (non-linear between them) color spaces, our algorithm is perfectible by combining (or fusing) our model with those using as features; the LTP texture [20] or the notion of pixel pairs [38] (or possibly different other features).

Advertisement

4. Conclusion

In this paper, we have proposed a novel and simple saliency measure and segmentation method. The proposed method is based on an energy-based model, involving a pixel pair modeling using color features and segmentation-based constraints finally combined with a fusion procedure taking into account the specific property and the complementarity of several color space models. The proposed model has very few internal parameters and is designed to be highly adaptive to the image content. Qualitative and quantitative results show that the proposed method is effective and performs particularly well against state-of-the-art methods, especially in terms of optimal F-measure (for a given optimal threshold). In addition to its efficiency, the proposed model achieves a good compromise between simplicity and accuracy while being fast enough, easily parallelizable (in different ways), and also perfectible if more color spaces or more constraints are added to the fusion procedure.

Advertisement

Acknowledgments

This research was funded by individual discovery grant RGPIN-2022-03654. The data supporting the results of this study are freely available at:

HTTP://WWW.IRO.UMONTREAL.CA/MIGNOTTE/RESEARCHMATERIAL/INDEX.HTML

The author wishes to acknowledge the NSERC (Natural Sciences and Engineering Research Council of Canada) for having funded this study under the individual discovery grant program (RGPIN-2022-03654). The author certifies no potential conflicts of interest influencing this study.

Advertisement

Data Availability Statement

The data (C++ code under Linux OS, makefile, image databases, etc.) supporting the results of this article are freely accessible at:

http://www.iro.umontreal.ca/∼mignotte/ResearchMaterial/index.html.

References

  1. 1. Itti L, Koch C. Computational modelling of visual attention. Nature Reviews Neuroscience. 2001;2(3):194-203
  2. 2. Itti L, Koch C, Niebur E. A model of saliency-based visual attention for rapid scene analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence. 1998;20(11):1254-1259
  3. 3. Ma Y-F, Zhang H-J. Contrast-based image attention analysis by using fuzzy growing. In: Proceedings of the Eleventh ACM International Conference on Multimedia, ser. MULTIMEDIA ‘03. New York, NY, USA: ACM. 2003. pp. 374–381
  4. 4. Harel J, Koch CC, Perona P. Graph-based visual saliency. In: Proceedings of the 19th International Conference on Neural Information Processing Systems. ser. NIPS’06. Cambridge, MA, USA: MIT Press. 2006. pp. 545–552
  5. 5. Harel J, Koch C, Perona P. Graph-based visual saliency. In: Schölkopf B, Platt JC, Hoffman T, editors. Advances in Neural Information Processing Systems 19. Vancouver, British Columbia, Canada: MIT Press; 2007. pp. 545–552
  6. 6. Hou X, Zhang L. Saliency detection: A spectral residual approach. In: Computer Vision and Pattern Recognition, 2007. CVPR’07. IEEE Conference on. IEEE. 2007, pp. 1–8
  7. 7. Achanta RR, Estrada F, Wils P, Susstrunk S. Salient region detection and segmentation. Gasteratos A, Vincze M, Tsotsos JK, editors. In: Computer Vision Systems. Berlin, Heidelberg: Springer Berlin Heidelberg; 2008. pp. 66–75
  8. 8. Achanta R, Estrada F, Wils P, Susstrunk S. Salient region detection and segmentation. In: Proceedings of the 6th International Conference on Computer Vision Systems, ser. ICVS’08. Berlin, Heidelberg: Springer-Verlag; 2008. pp. 66–75
  9. 9. Guo C, Ma Q, Zhang L. Spatio-temporal saliency detection using phase spectrum of quaternion fourier transform. In: IEEE, Conf. Computer Vision and Pattern Recognition. 2008
  10. 10. Achanta R, Hemami SS, Estrada FJ, Susstrunk S. Frequency-tuned salient region detection. In: CVPR. IEEE Computer Society. 2009. pp. 1597–1604
  11. 11. Klein D, Frintrop S. Center-surround divergence of feature statistics for salient object detection. In: IEEE International Conference on Computer Vision, ICCV 2011, Barcelona, Spain, November 6-13, 2011. 2011. pp. 2214–2219
  12. 12. Zhang J, Ma S, Sameki M, Sclaroff S, Betke M, Lin Z, et al. Salient object subitizing. International Journal of Computer Vision. 2017;124(2):169-186
  13. 13. Zhai Y, Shah M. Visual attention detection in video sequences using spatiotemporal cues. In: Proceedings of the 14th ACM International Conference on Multimedia, ser. MM ‘06. New York, NY, USA: ACM. 2006. pp. 815–824
  14. 14. Cheng M-M, Zhang G-X, Mitra NJ, Huang X, Hu S-M. Global contrast based salient region detection. In: Proceedings of the 2011 IEEE Conference on Computer Vision and Pattern Recognition, ser. CVPR ‘11. Washington, DC, USA: IEEE Computer Society. 2011. pp. 409–416
  15. 15. Tal A, Zelnik-Manor L, Goferman S. Context-aware saliency detection. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2012;34(10):1915-1926
  16. 16. Shen X, Wu Y. A unified approach to salient object detection via low rank matrix recovery. In: Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), ser. CVPR ‘12. Washington, DC, USA: IEEE Computer Society. 2012. pp. 853–860
  17. 17. Jiang H, Wang J, Yuan Z, Wu Y, Zheng N, Li S. Salient object detection: A discriminative regional feature integration approach. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR). June 2013
  18. 18. Liu Z, Zou W, Le Meur O. Saliency tree: A novel saliency detection framework. IEEE Transactions on Image Processing. 2014;23(5):1937-1952
  19. 19. Cheng M, Mitra NJ, Huang X, Torr PHS, Hu S. Global contrast based salient region detection. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2015;37(3):569-582
  20. 20. Ndayikengurukiye D, Mignotte M. Salient object detection by ltp texture characterization on opposing color pairs under slico superpixel constraint. Journal Imaging (JI), Special Issue: Advances in Color Imaging. 2022;8(4):11. Available from: https://www.mdpi.com/2313-433X/8/4/110
  21. 21. Yan Q, Xu L, Shi J, Jia J. Hierarchical saliency detection. In: Proceedings of the 2013 IEEE Conference on Computer Vision and Pattern Recognition, ser. CVPR ‘13. Washington, DC, USA: IEEE Computer Society. 2013. pp. 1155–1162
  22. 22. Aytekin C, Kiranyaz S, Gabbouj M. Automatic object segmentation by quantum cuts. In: 2014 22nd International Conference on Pattern Recognition. Aug 2014, pp. 112–117
  23. 23. Liu T, Yuan Z, Sun J, Wang J, Zheng N, Tang X, et al. Learning to detect a salient object. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2011;33(2):353-367
  24. 24. Huang F, Qi J, Lu H, Zhang L, Ruan X. Salient object detection via multiple instance learning. IEEE Transactions on Image Processing. 2017;26(4):1911-1922
  25. 25. Peng H, Li B, Ling H, Hu W, Xiong W, Maybank SJ. Salient object detection via structured matrix decomposition. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2017;39(4):818-832
  26. 26. Zhu W, Liang S, Wei Y, Sun J. Saliency optimization from robust background detection. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR). June 2014
  27. 27. Yan Y, Ren J, Sun G, Zhao H, Han J, Li X, et al. Unsupervised image saliency detection with gestalt-laws guided optimization and visual attention based refinement. Pattern Recognition. 2018;79:65-78
  28. 28. Ye L, Liu Z, Li L, Shen L, Bai C, Wang Y. Salient object segmentation via effective integration of saliency and objectness. IEEE Transactions on Multimedia. 2017;19(8):1742-1756
  29. 29. Zhang L, Tong MH, Marks TK, Shan H, Cottrell GW. Sun: A bayesian framework for saliency using natural statistics. Journal of Vision. 2008;8(7):32.1-20. Available from: https://jov.arvojournals.org/article.aspx?articleid=2297284
  30. 30. Li X, Lu H, Zhang L, Ruan X, Yang M-H. Saliency detection via dense and sparse reconstruction. In: The IEEE International Conference on Computer Vision (ICCV). December 2013
  31. 31. Xie Y, Lu H, Yang M. Bayesian saliency via low and mid level cues. IEEE Transactions on Image Processing. 2013;22(5):1689-1698
  32. 32. Mai L, Niu Y, Liu F. Saliency aggregation: A data-driven approach. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR). June 2013
  33. 33. Rahtu E, Kannala J, Salo M, Heikkilä J. Segmenting salient objects from images and videos. In: Daniilidis K, Maragos P, Paragios N, editors. Computer Vision – ECCV 2010. Berlin, Heidelberg: Springer Berlin Heidelberg; 2010. pp. 366-379
  34. 34. Fu K, Gu IY, Yang J. Saliency detection by fully learning a continuous conditional random field. IEEE Transactions on Multimedia. 2017;19(7):1531-1544
  35. 35. Qiu W, Gao X, Han B. A superpixel-based crf saliency detection approach. Neurocomputing. 2017;244:19-32
  36. 36. Yang J, Yang M. Top-down visual saliency via joint crf and dictionary learning. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2017;39(3):576-588
  37. 37. Feng L, Li H, Cheng D, Zhang W, Xiao C. An improved saliency detection algorithm based on edge boxes and bayesian model. Traitement du Signal. 2022;39:59-70
  38. 38. Mignotte M. Saliency map estimation using an unsupervised pairwise pixel-based MRF model. MDPI (Multidisciplinary Digital Publishing Institute), Mathematics (-2156922), Section: Probability and Statistics, Special Issue: Bayesian Inference Modeling Applications (open access). 2023;11(4):986
  39. 39. Junwei H, Ngi NK, Mingjing L, HongJiang Z. Unsupervised extraction of visual attention objects in color images. IEEE Transactions on Circuits and Systems for Video Technology. 2006;16(1):141-145
  40. 40. da Fontoura Costa L. Visual Saliency and Attention as Random Walks on Complex Networks. arXiv. physics/0603025. 2007
  41. 41. Gopalakrishnan V, Hu Y, Rajan D. Random walks on graphs for salient object detection in images. IEEE Transactions on Image Processing. 2010;19:3232-3242
  42. 42. Wang W, Wang Y, Huang Q, Gao W. Measuring visual saliency by site entropy rate. In: The Twenty-Third IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2010, San Francisco, CA, USA, 13-18 June 2010. 2010. pp. 2368–2375
  43. 43. Jiang B, Zhang L, Lu H, Yang C, Yang M-H. Saliency detection via absorbing markov chain. In: The IEEE International Conference on Computer Vision (ICCV), December 2013
  44. 44. Yuan Y, Li C, Kim J, Cai W, Feng DD. Reversion correction and regularized random walk ranking for saliency detection. IEEE Transactions on Image Processing. 2018;27(3):1311-1322
  45. 45. Zhang L, Ai J, Jiang B, Lu H, Li X. Saliency detection via absorbing markov chain with learnt transition probability. IEEE Transactions on Image Processing. 2018;27(2):987-998
  46. 46. Tang W, Wang Z, Zhai J, Yang Z. Salient object detection via two-stage absorbing markov chain based on background and foreground. Journal of Visual Communication and Image Representation. 2019;71:102727. DOI: 10.1016/j.jvcir.2019.102727
  47. 47. Xia C, Li X, Zhao L. Infrared small target detection via modified random walks. Remote Sensing. 2018;10(12). Available from: https://www.mdpi.com/2072-4292/10/12/2004
  48. 48. Singh V, Kumar N. Cobra: Convex hull based random walks for salient object detection. Multimedia Tools and Applications. 2022;81(21):30283-30303
  49. 49. Jiang F, Kong B, Li J, Dashtipour K, Gogate M. Robust visual saliency optimization based on bidirectional Markov chains. Cognitive Computation. 2021;13(1):69-80
  50. 50. Pengfei L, Xiaosheng Y, Jianning C, Chengdong W. Saliency detection via absorbing markov chain with multi-level cues. In: IEICE Transactions on Fundamentals of Electronics, Communications and Computer Sciences, vol. advpub, p. 2021EAL2071. 2021
  51. 51. Wu J, Han G, Liu P, Yang H, Luo H, Li Q. Saliency detection with bilateral absorbing markov chain guided by depth information. Sensors. 2021;21(3)
  52. 52. Mignotte M. MDS-based multiresolution nonlinear dimensionality reduction model for color image segmentation. IEEE Transactions on Neural Networks. 2011;22(3):447-460
  53. 53. Mignotte M. MDS-based segmentation model for the fusion of contour and texture cues in natural images. Computer Vision and Image Understanding. 2012;116(9):981-990
  54. 54. Mignotte M. A multiresolution markovian fusion model for the color visualization of hyperspectral images. IEEE Transactions on Geoscience and Remote Sensing. 2010;48(12):4236-4247
  55. 55. Mignotte M. A bi-criteria optimization approach based dimensionality reduction model for the color display of hyperspectral images. IEEE Transactions on Geoscience and Remote Sensing. 2012;50(2):501-513
  56. 56. Moevus A, Mignotte M, de Guise J, Meunier J. A perceptual map for gait symmetry quantification and pathology detection. BioMedical Engineering OnLine (BMEO). 2015;14(1):99. Available from: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4659413/
  57. 57. Touati R, Mignotte M. MDS-based multi-axial dimensionality reduction model for human action recognition. In: Eleventh conference on Computer and Robot Vision, CRV’2014, Montréal, Quebec, Canada. May 2014. pp. 262–267
  58. 58. Mignotte M. A label field fusion bayesian model and its penalized maximum rand estimator for image segmentation. IEEE Transactions on Image Processing. 2010;19(6):1610-1624
  59. 59. Mignotte M. An energy based model for the image edge histogram specification problem. IEEE Transactions on Image Processing. 2012;21(1):379-386
  60. 60. Mignotte M. Non-local pairwise energy based model for the HDR image compression problem. Journal of Electronic Imaging. 2012;21(1):013016
  61. 61. Touati R, Mignotte M, Dahmane M. Multimodal change detection in remote sensing images using an unsupervised pixel pairwise-based markov random field model. IEEE Transactions on Image Processing. 2019;29(1):757-767
  62. 62. Khlif A, Mignotte M. Segmentation data visualizing and clustering. Multimedia Tools and Applications. 2016;76(1):1-22
  63. 63. Torgerson W. Multidimensional scaling: I. theory and method. Psychometrika. 1952;17:401-419
  64. 64. Cox T, Cox M. Multidimensional Scaling. London: Chapman & Hall; 1994
  65. 65. Faloutsos C, Lin K-I. FastMap: A fast algorithm for indexing, data-mining and visualization of traditional and multimedia datasets. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, San Jose, California. June 1995, pp. 163–174
  66. 66. Nyström EJ. Über die praktische auflösung von integralgleichungen mit anwendungen auf randwertaufgaben. Acta Mathematica. 1930;54(1):185-204
  67. 67. Platt JC. Fastmap, metricmap, and landmark mds are all nystrom algorithms. In: Proceedings of 10th International Workshop on Artificial Intelligence and Statistics (AISTATS). 2005. pp. 261–268
  68. 68. Borji A, Cheng M, Jiang H, Li J. Salient object detection: A benchmark. IEEE Transactions on Image Processing. 2015;24(12):5706-5722
  69. 69. Wang J, Jiang H, Yuan Z, Cheng M-M, Hu X, Zheng N. Salient object detection: A discriminative regional feature integration approach. International Journal of Computer Vision. 2017;123(2):251-268
  70. 70. Liu G, Yang J. Exploiting color volume and color difference for salient region detection. IEEE Transactions on Image Processing. 2019;28(1):6-16
  71. 71. Felzenszwalb P, Huttenlocher D. Efficient graph-based image segmentation. International Journal of Computer Vision. 2004;59(2):167-181
  72. 72. Mignotte M. Segmentation by fusion of histogram-based k-means clusters in different color spaces. IEEE Transactions on Image Processing. 2008;17(5):780-787
  73. 73. Mignotte M. A label field fusion model with a variation of information estimator for image segmentation. Information Fusion. 2014;20:7-20
  74. 74. Banks S. Signal Processing, Image Processing and Pattern Recognition. Upper Saddle River, NJ: Pearson Prentice Hall; 1990
  75. 75. Martinkauppi JB, Soriano MN, Laaksonen MH. Behavior of skin color under varying illumination seen by different cameras at different color spaces. In: Proc. SPIE, Machine Vision Applications in Industrial Inspection IX, San Jose California. January 2001. pp. 102–113
  76. 76. Braquelaire J-P, Brun L. Comparison and optimization of methods of color image quantization. IEEE Transactions on Image Processing. 1997;6:1048-1952
  77. 77. Stokman H, Gevers T. Selection and fusion of color models for image feature detection,” IEEE Transactions on Image Processing. 2007;29:371–381
  78. 78. Kato Z. A Markov Random Field image segmentation model for color textured images. Image and Vision Computing. 2006;24(10):1103-1114
  79. 79. Perez P, Hue C, Vermaak J, Gangnet M. Color-based probabilistic tracking. In: Eur. Conf. on Computer Vision, ECCV’2002, LNCS 2350, Copenhaguen, Denmark. June 2002. pp. 661–675
  80. 80. Shi J, Yan Q, Xu L, Jia J. Hierarchical image saliency detection on extended CSSD. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2016;38(4):717-729
  81. 81. Reina G, Ertl T. Implementing fastmap on the GPU: Considerations on general-purpose computation on graphics hardware. Eurographics UK Theory and Practice of Computer Graphics. 2005;01:51-58

Notes

  • The basic idea of the Nyström method can be deemed as choosing a subset of samples to obtain a rough non-linear mapping (or embedding) and then by estimating the solution obtained to the complete set of remaining samples (using a technique similar to extrapolation from the original data).
  • The first step of the FastMap procedure consists in selecting two feature vectors (or objects) the most different to build the projection line. These two feature vectors are estimated by using a deterministic algorithm named choose-distant-objects [65]. The second step consists in projecting any other object onto this orthogonal axis (named a pivot line) by using the cosine law to the angle (see Algorithm 1). The FastMap C++ code is openly accessible online.
  • For example, RGB is an additive color system based on tri-chromatic theory and non-linear with visual perception. HSV is noteworthy because it allows chromatic information to be decoupled from shading effects. YIQ color channels have the interesting property of separately encoding chrominance and luminance information (which can be interesting in the field of compression). In addition, this color space is intended to take advantage of human color characteristics. XYZ has the benefit of being more linear from a psycho-visual point of view, although they are non-linear when it comes to mixing colors of linear components. LAB color is designed to approximate human vision. It aspires to perceptual uniformity, and its L component closely matches human perception of lightness. Finally, The LUV components give an Euclidean color space and provide a perceptually uniform spacing of color quite close to a Riemannian space.
  • The major reason for over-weighting precision over recall is that recall rate is an error metric that is not as important as precision [23, 68] (since 100% recall can be easily achieved by labeling the entire image as a saliency region).

Written By

Max Mignotte

Submitted: 08 June 2023 Reviewed: 31 July 2023 Published: 02 April 2024