Performance comparison of standard HEVC, JPEG, JPEG2000, and the two proposed CGC+SF model based approaches in terms of coded rate (bpp), PSNR, and the absolute RMS contrast error.
Perceptual coding is a subdiscipline of image and video coding that uses models of human visual perception to achieve improved compression efficiency. Nearly, all image and video coders have included some perceptual coding strategies, most notably visual masking. Today, modern coders capitalize on various basic forms of masking such as the fact that distortion is harder to see in very dark and very bright regions, in regions with higher frequency content, and in temporal regions with abrupt changes. However, beyond these obvious forms of masking, there are many other masking phenomena that occur (and co-occur) when viewing natural imagery. In this chapter, we present our latest research in perceptual image coding using natural-scene masking models. We specifically discuss: (1) how to predict local distortion visibility using improved natural-scene masking models and (2) how to apply the models to high efficiency video coding (HEVC). As we will demonstrate, these techniques can offer 10–20% fewer bits than baseline HEVC in the ultra-high-quality regime.
- visual masking
- contrast gain control
- adaptive quantization
Recent advancements in digital signal processing technologies have made available a wide variety of digital media for end use by consumers and practitioners. It is estimated that more than 100 billion digital photos and videos are recorded, transmitted, and viewed annually just in the United States. Today, the tremendous popularity of ubiquitously connected digital imaging devices has made the Internet the standard means by which to share imagery. Of course, digital images/videos have many uses beyond entertainment, including online education, video conferencing, remote medical diagnoses, and many others. Such widespread use of digital images and videos places a great demand on compression algorithms which are absolutely crucial for reducing the bandwidth requirements of storing and transmitting these images and videos.
To this end, state-of-the-art image/video compression algorithms exploit the fact that the human visual system (HVS) is an imperfect sensor. When a digital image/video is to be viewed by a human, an exact bit-for-bit reconstruction is unnecessary; rather, the data can be coded in a non-invertible or lossy fashion. Lossy compression is useful for applications where lower information fidelity can be tolerated, such as in consumer photography, computer vision, and machine vision applications. If the compression distortions are invisible, the compression is said to be visually lossless. Visually lossless compression techniques generally take advantage of a low-level psychophysical phenomenon such as visual masking. If, on the other hand, the compression distortions are visible, the compression is called visually lossy. Visually lossy compression techniques aim to generate the best-looking reconstructed version under the given bit-rate constraints. Both of these paradigms fall under the more general category of the so-called perceptual coding, owing to the need to model the human visual system (HVS), and in particular, how the HVS detects and perceives compression-induced distortions.
With the release of each new coding standard, the emphasis in perceptual coding research has largely shifted from the mid-quality regime toward the ultra-high-quality regime, with the aim of producing compressed images and videos which are visually equivalent to the originals. Thus, research in visually lossless compression has seen a recent resurgence in importance. In this chapter, we focus exclusively on visually lossless image compression. The key challenge in visually lossless compression is to automatically determine, on a per image basis, the maximum amount of compression that can be applied before the resulting image appears distorted. However, to tackle this challenge requires the ability to accurately and efficiently predict the visibility of local distortions in an image, a task which still remains elusive in the current research.
Perceptual coding strategies have long relied on well-known properties of the HVS largely derived from the visual psychophysics literature (e.g., see [1, 2]). Perhaps, the most well-known and widely used property is the contrast sensitivity function (CSF), which specifies the visibility of a narrowband spatial pattern (the target of detection) as a function of the pattern’s spatial or temporal frequency. Previous psychophysical studies have shown that the minimum contrast needed to detect a visual target (e.g., distortions) varies with both the spatial frequency and the temporal frequency of the target. This minimum contrast is called the contrast threshold, and the inverse of this threshold is called contrast sensitivity. For targets consisting of spatial sine waves, the CSF is band-pass, indicating that we are least sensitive to very low-frequency and very high-frequency targets. The temporal CSF is an extension of the spatial CSF which takes into account sensitivity to time-varying targets, typically demonstrating a peak in sensitivity around 4–8 Hz.
The CSF can be thought of as a baseline visual sensitivity measure because the CSF is traditionally measured for targets shown against a blank background. However, for targets consisting of compression distortions, this blank-background scenario occurs only when the distortions happen to appear in very smooth regions such as in the sky. In other image regions, such as in structures, textures, and hybrids regions, the distortions are generally more difficult to detect (i.e., they exhibit higher contrast detection thresholds), and therefore, visual sensitivity to the distortions is said to be reduced in these regions. This concept of visual masking has served as the cornerstone of modern perceptual coding.
At the most general level, visual masking refers to a reduction or elimination in the visibility of one signal (called the “target”) caused by the presence of another signal (called the “mask”). For image compression, the image serves as the mask, and the compression distortions serve as the targets of detection. There are various forms of visual masking which can occur and co-occur in images and video. For example, it is well-known that humans have a harder time seeing distortions in very bright regions of an image, an HVS property called luminance masking. To capitalize on this fact, modern coding schemes more coarsely quantize the coefficients corresponding to (devote fewer bits to) locations of higher luminance. A similar strategy can be used for very busy regions of an image (contrast masking) or during scene changes in video (temporal masking).
These low-level aspects of the HVS are so commonly used in image/video coding for two simple reasons: (1) they are easy to incorporate and (2) such low-level aspects have been well-documented in the visual psychology literature with accompanying computational models. However, most existing models of masking (and thus, existing perceptual coding techniques) are largely based on findings using artificial stimuli rather than on a true database of natural scenes. The advantage of these artificial masks is that they have well-defined features and parameters, which allows one to investigate the effects of specific mask properties on the detection thresholds. However, in image compression, the mask is necessarily an image, and thus, it remains unclear whether the results obtained using artificial masks can be used to predict the results obtained using natural scene masks. There are some studies using natural scenes as masks, but these studies either employed only a limited number of tested images, or the thresholds were limited to select spatial locations within images (e.g., [3–5]).
In this chapter, we present our latest research in visually lossless image compression which operates based on the concept of masking maps predicted from a natural-scene masking model built upon a large local masking database . Specifically, we recently published the results of a large-scale psychophysical study designed to obtain local contrast detection thresholds (masking maps) for a database of natural images . This database can serve as crucial ground-truth data for investigating on how local image content affects the visual masking thresholds. Using this database, we present an high efficiency video coding (HEVC)-based quantization scheme which uses the contrast gain control (CGC) with structure facilitation model trained on the database of local masking thresholds to predict a masking map for the to-be-compressed image. The masking map is then used to guide a spatially adaptive quantization scheme, which more coarsely quantizes the blocks that can induce greater masking, and vice-versa. Using this approach, our technique can generate compressed images in which the contrasts of the local compression artifacts are much closer to their masked visibility thresholds than when using standard HEVC.
This chapter is organized as follows: Section 2 provides a brief review of current visually lossless perceptual image compression algorithms. In Section 3, we describe the computational models used to predict the masking map for any given input image. In Section 4, we describe how to incorporate the masking map to perform spatially adaptive compression using HEVC. In Section 5, we analyze and discuss the performance of the proposed visually lossless compression method. General conclusions are presented in Section 6.
2. Previous work on perceptual image compression
As we mentioned, the goal of visually lossless image compression is to generate images containing distortions at or just below the visual detection threshold. To this end, previous work in this area has exploited properties of the HVS (most notably the CSF and visual masking) and has taken a variety of approaches toward incorporating these visual properties into the transform, quantization, and/or encoding stages. In this section, we briefly review previous work on perceptual (HVS-based) image compression.
Perceptual image compression techniques can be dated back as early as 1990s when Safranek et al.  published one of earliest attempts at incorporating HVS properties into compression through a system called perceptually tuned subband image coder (PIC). Three properties of low-level vision were modeled in PIC: (1) contrast sensitivity, (2) luminance masking, and (3) contrast masking. These properties were used to guide the selection of per-subband quantization step sizes designed to yield visually lossless results. Although PIC was initially designed for visually lossless compression, Pappas et al.  reported that this system can also be used for visually lossy compression, and high performance can be achieved when the perceptual thresholds are properly scaled. Also, Hontsch et al.  extended PIC by exploiting visual masking; they proposed a locally adaptive perceptual coder, which discriminates between image components based on their perceptual relevance.
Later research on compression has exploited the properties of the HVS and employed the CSF to regulate the quantization step size in order to minimize the visibility of compression artifacts. For example, Nadenau et al.  incorporated HVS properties into a wavelet-based coding algorithm via a noise-shaping filtering stage which preceded quantization. Albanesi  proposed a method for incorporating HVS characteristics directly into the transform stage of a wavelet-based coder via the design of analysis and synthesis filters based on the CSF. Antonini et al.  introduced a wavelet coder which employed a CSF-weighted distortion criterion during bit allocation. O’Rourke et al.  proposed a wavelet-based image compression technique based on two properties of the HVS: orientation sensitivity and contrast sensitivity. Specifically, the diamond-shaped frequency passband of the HVS was exploited for the design of the compression scheme, and the logarithm of the contrast sensitivity was employed for bit allocation. Lai et al.  presented an image compression scheme in which contrast-sensitivity and visual masking adjustments were performed within a wavelet-based coder using a low-pass model of the CSF and a local measure of visual distortion. In two similar approaches, Beegan et al.  used a “CSF mask” to adjust transform coefficients prior to the quantization, and Wei et al.  used a “visual compander.” Also, in , Zhang et al. proposed luminance and chrominance CSF-based weighting in the discrete-wavelet-packet-transform domain to reduce perceptible information of the high-dynamic-range images.
There are also some researchers who conducted psychophysical experiment to measure visibility thresholds for compression artifacts in unnatural images and/or on natural scenes. For example, Watson et al.  measured visual detection thresholds for both individual wavelet basis functions and simulated wavelet subband quantization distortions presented against a gray background. The thresholds were modeled as a function of the spatial frequency of the distortions, and the model was then used to compute quantizer step sizes for each wavelet subband. In , Watson’s approach was extended to lower rate coding via models of visual masking and summation. Nadenau et al.  measured the visibility thresholds of quantization noise in natural scenes and compared five visual masking models to predict the visibility thresholds. They concluded that a masking model considering local activity of the wavelet subbands performed better than point-wise contrast masking models.
In a recent study, Chandler et al.  proposed a new kind of masking called the structural masking by psychophysically measuring the visibility thresholds of wavelet distortions placed on small patches categorized in three groups: texture, structure, and edges. The authors have also proposed different set of values of parameters of contrast-gain control model  for three different categories and have shown that the category-specific masking model showed better compression results for wavelet-type compression schemes. Similarly, in , Chandler et al. proposed a visually lossless compression algorithm based on psychophysical detection experiments of wavelet distortion on radiograph images.
Several other studies have specifically focused on the visually lossless compression of JPEG and JPEG2000 compression schemes. For example, Oh et al.  developed a visually lossless compression model which allocates the code streams of the JPEG2000 encoder by measuring visibility thresholds via a wavelet statistics-based quantization distortion model and a visual masking model. In , Ponomarenko et al. pointed out that the visual quality of input (to-be-compressed) image has a large effect on the compression performance. Thus, they adaptively adjusted the scaling factor of the JPEG quantization matrix based on the estimated blur and noise content of the input image and showed that such a compression scheme gives larger compression ratio compared to super-high quality mode of consumer digital cameras. Leung et al.  proposed a JPEG2000-based visually lossless compression scheme for CT images in which the visibility thresholds varied according to the viewing window/display size of the CT image.
3. Computational models of local masking
This section describes the computational masking models that we developed to predict the masking map for the given input (to-be-compressed) image. First, we describe the ground-truth database used to train the models. Next, we describe a modified version of the model put forth by Watson and Solomon, which operates by simulating V1 neural responses with contrast gain control (CGC). Here, we have modified the model and optimized its parameters to provide the best predictions for the aforementioned database. In addition, we describe an extension of the model to deal with structural facilitation which we earlier reported in . Structural facilitation refers to the reduction in threshold (increased distortion visibility) in parts of the image containing highly recognizable structure.
3.1. Database of local masking in natural scenes
In , we performed a large-scale psychophysical experiment in which we measured thresholds for detecting simulated distortions placed within each 85 × 85 block of every image from the CSIQ database . The simulated distortion was a narrowband log-Gabor noise target whose center frequency was chosen to be near the peak of visual sensitivity (3.6 cycles/degree of visual angle). The thresholds were obtained using a three-alternative forced-choice procedure ; we employed at least three subjects per image, with at least two trials per subject. The end result of the experiment was a masking map for each of the 30 CSIQ images; each entry in each map denotes the minimum contrast required for a human subject to detect distortions at that location in the image.
Figure 1 shows the masking maps from the database. Each map consists of 36 values corresponding to the 36 blocks of the associated image. Brighter map values denote higher thresholds (i.e., more masking); darker maps values denote lower thresholds (less masking). The first and seventh rows of Figure 1 show the 30 mask images. Below the mask images, the first, second, and third images show the average maps of the two trials of Subject 1, Subject 2, and Subject 3. The remaining rows show the average maps (taken across all six trials; 2 × 3 subjects), and the corresponding maps of the standard deviations of each average. Note that the averages and standard deviations are on different scales; please refer to the respective color bars shown in Figure 1. Overall, the subjects were in high agreement with each other and with themselves across separate trials.
In the following subsection, we describe the contrast gain control with structure facilitation model which operates by simulating V1 neural responses to predict these masking maps.
3.2. Contrast gain control with structure facilitation (CGC+SF) model
Contrast masking  has been widely used for predicting distortion visibility in images and videos [28, 42–44]. Among the many existing models of contrast masking, those which simulate the contrast gain-control response properties of V1 neurons are most widely used. Although several contrast gain control (CGC) models have been proposed in previous studies (e.g., Refs. [20, 27, 30, 31, 41]), in most cases, the model parameters are selected based on results obtained using either unnatural masks  or only a very limited number of natural images. Thus, in this chapter, we describe two approaches to improve the current CGC model: (1) the CGC model parameters are optimized by training on the large dataset of local masking in natural scenes; and (2) the CGC model is incorporated by a structural facilitation (SF) model which better captures the reduced masking observed in structured regions.
3.2.1. Watson-Solomon contrast gain control (CGC) model
The Watson and Solomon model  is a model of V1 simple-cell responses that includes CGC from neighboring neurons. Figure 2 shows a block diagram of the model. The model takes two images as input: (1) the mask image (original image), and (2) the mask+target image (distorted image). Both of these images are then subjected to the following stages:
A spatial filter designed to mimic the human contrast sensitivity function (CSF).
A local spatial-frequency decomposition designed to mimic the initially linear response properties of individual V1 neurons.
Excitatory and inhibitory nonlinearities designed to mimic the nonlinear response properties of individual V1 neurons.
Divisive inhibition designed to mimic the interactions among groups of V1 neurons.
Steps 1 and 2: For Step 1, we use the CSF filter specified in [32, 33]. For Step 2, we use a log-Gabor filterbank consisting of six scales and six orientations. The center radial frequencies of the filters are 0.3, 0.61, 1.35, 3.22, 7.83, 16.1 c/deg, each with a radial-frequency bandwidth of 2.75 octaves. The center orientations of the filters are , each with an orientation bandwidth of .
Steps 3 and 4: Let denote the output of the log-Gabor filter with a center of radial frequency an orientation , and at the spatial location . This filter output represents the initially linear response of the neuron. To obtain the nonlinear neural response, , we perform Steps 3 and 4 via the following equation:
Here, is an output gain factor (we use ). The parameters and are the exhitatory and inhibitory exponents which impose the nonlinearities (we use and ). The parameter is a constant designed to prevent division by zero (we use ). The division simulates inhibition from neighboring neurons; these neurons constitute the so-called inhibitory pool, and they are neighbors in space, radial frequency, and orientation. In Eq. (1), the inhibitory pool is represented by the set of spatial and spatial frequency coordinates . The neighbors come from a 3 × 3 surround in space, a ±0.7 octave bandwidth surround in radial frequency, and a bandwidth surround in orientation.
All of the abovementioned parameters (, , , , and ) were chosen via a brute-force search to provide the best overall fit to the thresholds from our database, under the condition that the parameters remain within biologically plausible ranges . The radial frequency bandwidth and center radial frequencies were chosen in this way as well. The other parameters of the model were either set as specified in  or were chosen based on our prior related modeling efforts .
Comparing the responses: Step 4 results in two collections of responses: One collection of responses to the mask, and another set of responses to the mask+target. The target is deemed visible if these collections of responses are sufficiently different from each other; thus, indicating a visible difference in the two stimuli (i.e., that the distortions are visible). To determine whether this condition is met, the collections of responses are subtracted from each other, then collapsed via Mikowski sum , and then this scalar difference () is compared to a pre-defined “at-threshold” difference value (). We used a Minkowski exponent of 2.0 to collapse across space, and an exponent of 1.5 to collapse across radial frequency and orientation. The contrast of the target is iteratively adjusted until . When this condition is met, the contrast of the target is deemed to be the at-threshold contrast (i.e., the contrast detection threshold).
We refer interested readers to  for more specific details of the database and model.
3.2.2. Structure facilitation (SF) model
Using the optimized parameters described in the previous subsection, our implementation of the Watson and Solomon CGC model is quite accurate in predicting detection thresholds. On our database, the model is able to achieve a Pearson correlation coefficient (PCC) of 0.83 between the ground-truth and predicted thresholds. Generally, the model works best on regions containing textures and is worst on regions containing more complex structure. In particular, the model tends to overestimate thresholds for regions containing recognizable structure. This notion is demonstrated in Figure 3, which shows the ground-truth and predicted thresholds for two images; observe that the model predict the thresholds to be higher than ground-truth near the top of the gecko’s body and in the child’s face.
As we mentioned in , recognizable structures within the local regions of natural scenes facilitate (rather than mask) the distortion visibility. Thus, to model this “structure facilitation,” we employ an inhibition modulation factor () in the gain control equation:
where we adjust depending on the strength of structure within an image. Although the specific amount of inhibition modulation remains an open area of research, we have found the following sigmoidal relationship between and estimated structure strength to be quite effective (shown in Figure 4):
Observe that the inhibition modulation is applied in a block-based fashion. Here, denotes the inhibition modulation factor for the ith block of size .
The variable in Eq. (3) is a map which denotes the local structure strength (described next), and is a block of corresponding to the ith block of the image. The inhibition modulation for each block is further adjusted based on 80% largest values of S, denoted by the variable . Furthermore, if the largest value of S is small, or if the kurtosis of S is small, then there is either no sufficient structure (e.g., the image is mostly textured or smooth), or the structure is not locally concentrated. In this case, no inhibition modulation is applied (i.e., , for all blocks) (Figure 4).
The structure map of an image is generated via the following equation which uses different feature maps:
Here, , , and denote maps of local luminance, local sharpness , and local first-order Shannon entropy, respectively. The values and denote, respectively, maps of the average and the standard deviation of fractal texture features  computed for each local region. All features were computed for 32 × 32 blocks with 50% overlap between neighboring blocks. Each feature map was then normalized to the range [0, 1] and then resized to match the input image’s dimensions. Figure 5 shows some examples.
The prediction performance of the Watson and Solomon CGC model can be greatly improved when the structure facilitation is taken into account [as specified in Eq. (2)]. As demonstrated in Figure 6, the proposed SF model was able to improve the CGC model’s prediction performance in local image regions that contain recognizable structures, while not adversely affecting the prediction results of the others. For example, near the top of the gecko’s body and in the child’s face, the contrast detection thresholds predicted using the combined CGC+SF model match the ground-truth thresholds better than using the CGC model. Furthermore, the Pearson correlation coefficients between the CGC+SF model predictions and ground-truth thresholds also improved as compared to using the CGC model alone.
4. Application of the masking model to compression
The masking model described in the previous section provides a way of predicting a masking map for any given input image. In this section, we show how to use this masking map to achieve visually lossless compression. In particular, we describe two different ways of incorporating the masking maps into an HEVC image coder: (1) by adjusting the values in HEVC on a per-block basis; and (2) by pre-adjusting the image’s pixel values prior to the HEVC compression, and post-adjusting the pixel values of the decompressed image following HEVC decompression.
Similar to H.264/AVC, HEVC employs a uniform reconstruction quantizer for the transform coefficients. It is the quantization stage that introduces distortions; thus, to generate visually lossless results requires direct or indirect modification of the quantization step sizes (values) or quantization parameters (values). Previous efforts toward improved quantization have aimed at achieving higher PSNR values (e.g., [35, 36]) or other visual quality measures (e.g., [37, 38]). However, for visually lossless compression, we argue that the use of masking maps is a much better and logical alternative.
Our approach assumes that each local area within an image should have its own based on the amount of masking induced in that region. Note that the larger value is, the greater the contrast of the distortions. Therefore, the first step of our method is to predict a map consisting of block-based values, such that the resulting distortions in each corresponding block exhibit a contrast at the contrast threshold . Furthermore, as we mention later in Section 5, because the predicted values are underestimates of thresholds for normal viewing conditions (as opposed to the highly controlled viewing conditions used in the psychophysical experiment), we aim for values required to generate slightly greater than (greater by at most 10 dB).
4.1. Local QP estimation from the masking map
Let denotes the value for the ith block, and let denotes the contrast of the resulting distortions. Our objective is to employ a for the ith block such that the for that block is given by , where denotes the contrast threshold for the ith block. That is, we seek the value for each block required to make the block’s distortions at the threshold of visibility.
The primary difficulty in determining the relationship between and is that the relationship changes depending the patch. In our previous work , we used a regression model to predict the relationship between and on a per-block basis using statistical properties of each block as regressors. Although that approach was extremely fast, it suffered from a significant number of mispredicted values and thus induced distortions with incorrect contrasts. Here, we present a much more accurate solution based on the use of a pre-compression lookup table.
Specifically, prior to using HEVC, we perform the following steps:
STEP 1. Divide the image into 32 × 32 blocks (the maximum block size for HEVC).
STEP 2. Compute the 2D DCT of each block.
STEP 3. Iterate over a range from 1 to 51…
Quantize the block using a corresponding Qstep value given by Qstep = (21/6)QP - 4 as specified in .
Perform an inverse 2D DCT of each block.
Measure and record the contrast of the resulting distortions.
In this way, for each block, we record a table that can be used to look up the closest value required to achieve . Figure 7 shows the lookup table values in the forms of plots (vs. ) for eight different image blocks. Generating the lookup table requires only a small fraction of the total time required to encode the image because only a series of inverse 2D DCTs and contrast measurements in required. Most importantly, this technique provides extremely accurate selection of the values.
4.2. Spatially adaptive quantization using the QP map
Given the map, we present two approaches to implement the compression. The first approach, which is the more direct approach, assigns different values for each 64 × 64 block. This approach was implemented by modifying the reference HEVC profile to explicitly use a separate value for each 64 × 64 coding unit. This approach is straightforward to implement, but it lacks some flexibility.
The other approach, which can be used with any lossy compression algorithm, effects the spatially adaptive quantization using pre-processing and post-processing stages. Let and denote the two image pixels and their corresponding quantization step sizes are denoted by and , respectively. The quantized values of the two pixels (denoted by and ) are then given by
where is a scaling factor; is a factor that normalizes the scaled pixel value (e.g., ) into [0, 255]. Eqs. (5) and (6) indicate that different local image areas can have different quantization parameters even though the whole image is quantized using one uniform , as long as different image pixels are scaled properly.
For standard HEVC, the quantization step sizes relate to the values via . However, in our second approach, because pixel values are quantized, we relate the quantization step to value through
where t is a nonlinear coefficient which aims at increasing/decreasing the value range within a map; and are the ratio and offset parameters which adjust the quantization step size after the nonlinear transform. The block diagram of the second approach is shown in Figure 8. Specifically, in the pre-processing stage, the luma channel of an image is first multiplied by a scaling map (dented by ) and then divided by to have a range of [0, 255]. The scaling map is given by
where denote the values for different local image areas; denotes the average value of [i.e., ];is given by
In this chapter, we set , . Thus, and can be written as
In the post-processing stage, an inverse scaling map (dented by ) is applied to convert the scaled luminance to the original value:
In standard HEVC stage, the global is computed by
where and are the linear coefficients which adjust the RMS contrast of the distortions in the compressed image to be near or below the threshold. We estimated their values by fitting the model to the 30 images in the CSIQ database, and thus, we set , .
Two problems can occur with this approach. First, the map may possibly contain zero values, in which case the above equations are not valid. Second, the predicted block-based maps often contain abrupt changes of values on the patch edges, which may possibly deteriorate the qualities of the compressed images by producing the ringing or blocking artifacts especially at lower bit compression. To solve these two problems, we first set the local zero values to be the minimum value among all the extra values within the image and then applied a Gaussian filter to the modified maps. As we have observed, for most natural images, the image contrast should change smoothly, not abruptly, and consequently, the resulting maps should also be smooth. Figure 9 shows the 1600 image compressed using the map with and without the Gaussian filtering. Observe that the blocking artifacts occur in the compressed image (Figure 9a) if the original map was used; these blocking artifacts disappear when the map is smoothed by a Gaussian filter (Figure 9b).
In the following section, we show qualitative and quantitative results of using these two schemes with HEVC.
5. Results and discussion
In this section, we analyze the performance of the proposed visually lossless image coding algorithm. For this task, all 30 reference images in the CSIQ database were compressed at visually lossless rates using the proposed method and compared against standard HEVC. The main difference is that standard HEVC employs a uniform for coding the whole image, whereas our approach uses spatially adaptive values based on masking.
Furthermore, we have found that it is possible to induce distortions at up to 10 dB above the predicted values while still yielding images which are visually lossless under normal viewing conditions. The contrast thresholds measured in the aforementioned experiment and thus the contrast thresholds predicted by the CGC+SF model are accurate for the highly controlled viewing conditions; yet, they are quite conservative for normal, everyday viewing.
5.1. QP maps
The CGC+SF model takes the 64 × 64-pixels image patch as input and predicts the distortion contrast threshold () and the corresponding threshold map. Figure 10 shows the maps generated from the CGC+SF model for eight images in the CSIQ database.
Observe that the maps are indeed image-adaptive; that is, the pattern of how quantization step sizes are varied across space adapts based on the image content (which is itself based on the masking model and the relationships between and ). In general, the maps specify larger quantization step sizes for regions that can mask the resulting distortions, and small quantization step sizes for regions with less masking. For example, in the cactus image, the bodies of the cacti impose great masking, the bird and boundaries of the cacti impose much less masking, and sky has almost no masking. Accordingly, the values are smallest for the sky, larger for the bird and cacti boundaries, and largest for the bodies of the cacti.
Again, we remind the reader that the maps alone can provide only a rough gauge of how the distortions will be distributed across space. Recall from Figure 7 that the relationship between and the contrast of the resulting distortion C is very much patch-specific. The same applied to two different blocks can give rise to vastly different distortion contrasts.
5.2. Distortion contrast maps
The proposed coding approach assumes that to compress an image in a visually lossless manner, the RMS contrast of the distortion in any compressed image region should be near or below the ground truth RMS contrast threshold. Thus, to verify the effectiveness of our proposed approach, Figure 11 shows the contrast threshold maps (masking maps) for four sample images (as predicted by the CGC+SF model), as well as the resulting distortion contrast maps of the corresponding images coded with standard HEVC and the two proposed approaches. Note that the displayed contrast threshold maps are all 10 dB greater than predicted by the CGC+SF model due to the fact that the experimental contrast thresholds are overly conservative for normal viewing conditions. As we have found in our research, distortions with a contrast up to 10 dB above threshold can still remain visually undetectable under normal viewing conditions. Observe from Figure 11 that images coded by standard HEVC have quite different contrast patterns with the ground truth, whereas images coded by the proposed approaches appear quite similar in pattern to the masking maps. These figures demonstrate that it is possible to achieve better compression performance than standard HEVC if using maps and the proposed adaptive coding scheme. We will quantify the compression performance of each method in the following section.
5.3. Compression performance
|Image||Standard HEVC||JPEG||JPEG2000||Approach 1||Approach 2|
Table 1 shows the compression results of 30 images using standard HEVC, JPEG, JPEG2000, and the two proposed approaches. To compare with the standard HEVC, JPEG, and JPEG2000 coding methods, a visual quality matching experiment was performed by three experienced subjects. The purpose of the experiment was to find at which compression rate, the three reference coding methods (i.e., HEVC, JPEG, and JPEG2000) yielded images with just detectable distortions; the corresponding bit-rates of these “at-threshold” compressed images were then recorded. Note that all these five coding methods only add near or below-threshold distortions, and thus judging the quality of the images is quite difficult. Although the human subjective judgment is a more reliable way for assessing the intensities of the near/below threshold distortions, we also report the PSNRs and the absolute RMS contrast errors between the reference images and the coded images for reference.
From Table 1, observe that the second approach of the CGC+SF model demonstrates a reduction in coded rate (bpp) by an average factor of about 16% as compared with standard HEVC, while still maintaining relatively higher PSNR values and equivalent RMS contrast errors. In comparison, the first approach seems to work less effectively. This might due to the fact that fixed local values are applied to the local image areas, but some local values are improperly estimated because of the much complex image patches and potential model limitations. However, this straightforward approach still performs competitively well, considering the relatively smaller errors it produces. For the second approach, we employed additional parameters, which indirectly adjust the coded rate to meet the visually lossless requirement. Note that for each method, the average total error is around 700 dB, which means that for each block there is an approximately 10 dB RMS contrast error (each image contains 64 blocks) compared with the ground truth. This is also attributed to the three-alternative forced-choice procedure that has been used in the experiment and mentioned in Section 5.2. Also, it should be noted that we generated the maps mainly from contrast masking and structural facilitation. Thus, if an image does not contain areas that can sufficiently mask the distortions, using the map yields no gain.
This chapter described a computational model which predicts masking maps for any given input images, and two approaches which employ the predicted masking map to achieve visually lossless compression. The proposed computational model consists of a contrast gain control model, which was trained on a database of local masking thresholds in natural images, and a structural facilitation model, which was incorporated to take into account the effects of recognizable structures on distortion visibility. Compared with standard HEVC, our approach shows an average of 16% improvement in bit-rate when testing on the CSIQ database