A Real-Time Video Encoding Scheme Based on the Contourlet Transform

Real-time video communication over the internet and other heterogeneous IP networks has become a significant part of modern communications, underlining the need for highly effi‐ cient video coding algorithms. The most desirable characteristic of such an algorithm would be the ability to maintain satisfactory visual quality while achieving good compression. Ad‐ ditional advantageous characteristics would be low computational complexity and real-time performance, allowing the algorithm to be used in a wide variety of less powerful comput‐ ers. Transmission of video over the network would benefit by the ability to adapt to the net‐ work's end-to-end bandwidth and transmitter/receiver resources, as well as by resistance to packet losses that might occur. Additionally, scalability and resistance to noise would be highly advantageous characteristics for a modern video compression algorithm. Most state of the art video compression techniques like the H.264, DivX/Xvid, MPEG2 fail to achieve real time performance without the use of dedicated hardware due to their high computa‐ tional complexity. Moreover, in order to achieve optimal compression and quality they de‐ pend on multipass statistical and structural analysis of the whole video content, which cannot happen in cases of live video stream generation as in the case of video-conferencing.


Introduction
Real-time video communication over the internet and other heterogeneous IP networks has become a significant part of modern communications, underlining the need for highly efficient video coding algorithms. The most desirable characteristic of such an algorithm would be the ability to maintain satisfactory visual quality while achieving good compression. Additional advantageous characteristics would be low computational complexity and real-time performance, allowing the algorithm to be used in a wide variety of less powerful computers. Transmission of video over the network would benefit by the ability to adapt to the network's end-to-end bandwidth and transmitter/receiver resources, as well as by resistance to packet losses that might occur. Additionally, scalability and resistance to noise would be highly advantageous characteristics for a modern video compression algorithm. Most state of the art video compression techniques like the H.264, DivX/Xvid, MPEG2 fail to achieve real time performance without the use of dedicated hardware due to their high computational complexity. Moreover, in order to achieve optimal compression and quality they depend on multipass statistical and structural analysis of the whole video content, which cannot happen in cases of live video stream generation as in the case of video-conferencing.
In this chapter, a more elaborate analysis of a novel algorithm for high-quality real-time video encoding, originally proposed in [1], is presented. The algorithm is designed for content obtained from low resolution sources like web cameras, surveillance cameras, etc. Critical to the efficiency of video encoding algorithm design is the selection of a suitable image representation method. Texture representation methods proposed in the literature that utilize the Fourier transform, the Discrete Cosine transform, the Wavelet transform as well as other frequency domain methods have been extensively used for image and video encoding. Never-

The Contourlet Transform
The Contourlet Transform (CT) is a directional multiscale image representation scheme proposed by Do and Vetterli, which is effective in representing smooth contours in different directions of an image, thus providing directionality and anisotropy [2]. The method utilizes a double filter bank in which, first the Laplacian Pyramid (LP) [6] detects the point discontinuities of the image and then the Directional Filter Bank (DFB) [7] links those point discontinuities into linear structures. The LP provides a way to obtain multiscale decomposition. In each LP level, a downsampled lowpass version of the original image and a more detailed image with the supplementary high frequencies containing the point discontinuities are obtained. This scheme can be iterated continuously in the lowpass image and is restricted only by the size of the original image due to the downsampling. The DFB is a 2D directional filter bank that can achieve perfect reconstruction, which is an important characteristic for image and video encoding applications. The simplified DFB used for the contourlet transform consists of two stages and leads to 2 l subbands with wedge-shaped frequency partitioning [8], with l being the level of decomposition. The first stage of the DFB is a two-channel quincunx filter bank [9] with fan filters that divides the 2D spectrum into vertical and horizontal directions, while the second stage is a shearing operator that just reorders the samples. By adding a 45 degrees shearing operator and its inverse before and after a two-channel filter bank, a different directional frequency partition is obtained (diagonal directions), while maintaining the ability to perfectly reconstruct the original image, since the sampling locations coincide with the (integer) pixel grid.
The combination of the LP and the DFB is a double filter bank named Pyramidal Directional Filter Bank (PDFB). In order to capture the directional information, bandpass images from the LP decomposition are fed into a DFB. This scheme can be repeated on the coarser image levels. The combined result is the contourlet filter bank, which is a double iterated filter bank that decomposes images into directional subbands at multiple scales. The contourlet coefficients have a similarity with wavelet coefficients since most of them are almost zero and only few of them, located near the edge of the objects, have large magnitudes [10]. In the presented algorithm, the Cohen and Daubechies 9-7 filters [11] have been utilized for the Laplacian Pyramid. For the Directional Filter Bank, these filters were mapped into their corresponding 2D filters using the McClellan transform as proposed by Do and Vetterli in [2]. It must be noted that these filters are not considered as optimal. The creation of optimal filters for the contourlet filter bank remains an open research topic. An outline of the Contourlet Transform is presented on Figure 1, while an example of decomposition is shown on Figure 2.

GPU-based contourlet transform
By analysing the structure of the contourlet transform, it is evident that its most computationally intensive part is the calculation of all the 2D convolutions needed for complete decomposition or reconstruction. Calculating the convolutions on the CPU using the 2D convolution definition is not feasible for real-time applications since performance suffers significantly due to the computational complexity. Utilizing the DFT or the FFT in order to achieve better performance provides significantly faster implementations but still fails to achieve satisfactory real-time performance, especially in mobile platforms such as laptops and tablet PCs. The benefits of the FFT for the calculation of 2D convolution can only be fully exploited by an architecture supporting parallel computations. Modern personal computers are commonly equipped with powerful graphics processors (GPUs), which in the case of live video capture from web or surveillance cameras are underutilized. Intensive, repetitive computations that can be computed in parallel can be accelerated by harnessing this "dormant" computational power. General purpose computing on graphics processing units (GPGPU) is the set of techniques that use a GPU, which is otherwise specialized in handling computations for the display of computer graphics, in order to perform computations traditionally handled by a CPU. The highly parallel structure of GPUs makes them more effective than general-purpose CPUs for algorithms where processing of large blocks of data can be done in parallel.
For the GPU implementation of the contourlet transform, the NVIDIA Compute Unified Device Architecture (CUDA) has been selected due to the extensive capabilities and specialized API it offers. CUDA is a general purpose parallel computing architecture that allows the parallel compute engine in NVIDIA GPUs to be used in order to solve complex computational problems that are outside the scope of graphics algorithms. In order to compute the contourlet transform, first the image and the filters are transferred from the main memory to the GPU dedicated memory. Then, the contourlet transform of the image is calculated by migrating all the calculations on the GPU in order to reduce the unnecessary transfers to and from the main memory that introduce delay to the computations. The 2D convolutions required are calculated by means of the FFT. After calculating the contourlet transform of the image, the output is transferred back to the main memory and the GPU memory is freed. Considering that this implementation will be used for video encoding, the filters are loaded once at the GPU memory since they will not change from frame to frame. In order to evaluate the performance of this approach, various implementations of the contourlet transform were developed, both for the CPU and the GPU. These implementations were based on the FFT (frequency domain) and the 2D convolution definition (spatial domain). Except for the basic GPU implementation using spatial domain convolution, other out-of-core implementations were developed, based on the 2D convolution definition and utilizing memory management schemes in order to support larger frames when the GPU memory is not sufficient [12]. The GPU implementation based on the FFT outperformed all the aforementioned implementations in our tests and was therefore the method of choice for our video encoder.

The YCoCg colour space
Inspired by recent work on real-time RGB frame buffer compression using chrominance subsampling based on the YCoCg colour transform [13], we investigated the use of these techniques in conjunction with the contourlet transform to efficiently encode colour video frames.
The human visual system is significantly more sensitive to variations of luminance compared to variations of chrominance. Encoding the luminance channel of an image with higher accuracy than the chrominance channels provides a simple low complexity compression scheme, while maintaining satisfactory visual quality. Various image and video compression algorithms take advantage of this fact in order to achieve increased efficiency. First introduced in H.264 compression, the RGB to YCoCg transform decomposes a colour image into luminance (Y), orange chrominance (Co) and green chrominance (Cg) components and has been shown to exhibit better decorrelation properties than YCbCr and similar transforms [14]. It was developed primarily to address some limitations of the different YCbCr colour spaces [15]. The transform and its reverse are calculated by the following equations: In order for the reverse transform to be perfect and to avoid rounding errors, the Co and Cg components should be stored with higher precision than the RGB components. Experiments using 23 images from the Kodak image set and 18 images from the Canon image set, all obtained from [16], as well as 963 outdoor scene images obtained from [17], showed that using the same precision for the YCoCg and RGB data when transforming from RGB to YCoCg and back results in an average PSNR of more than 58.87 dB for all the image sets, as shown in Table 1. This loss of quality cannot be perceived by the human visual system, resulting to no visible alteration of the image. Nevertheless, it indicates the highest quality possible when used for image compression.

The presented algorithm
Listings 1 and 2 depict the presented algorithm for encoding and decoding respectively. Input frames are considered to be in the RGB format. The first step of the algorithm is the conversion from RGB to YCoCg colour space for further manipulation of the luminance and chrominance channels. The luminance channel is decomposed using the contourlet transform, while chrominance channels are subsampled by a user-defined factor N. The levels and filters for contourlet transform decomposition are also defined by the user. From the contourlet coefficients obtained by decomposing the luminance channel, only a user-specified percentage of the most significant ones are retained. Then, the precision allocated for storing the contourlet coefficients is reduced. All computations up to this stage are performed on the GPU, avoiding unnecessary memory transfers from the main memory to the GPU memory and vice versa. After reducing the precision of the retained contourlet coefficients of the luminance channel, the directional subbands are encoded using a run length encoding scheme that encodes only zero valued elements. The large sequences of zero-valued contourlet coefficients that occur after the insignificant coefficient truncation make run length encoding ideal for their encoding. The algorithm divides the video frames into two categories; keyframes and internal frames.
Keyframes are frames that are encoded using the steps described in the previous paragraph and internal frames are the frames between two key frames. The interval between two keyframes is a user defined parameter. At the step before the run-length encoding, when a frame is identified as an internal frame all its components are calculated as the difference between the respective components of the frame and those of the previous key frame. This step is processed on the GPU while all the remaining steps of the algorithm are performed on the CPU unless otherwise stated. Then, run length encoding is applied to the chromatic channels, the low frequency contourlet component of the luminance channel, as well as the directional subbands of the luminance channel. It must be noted that steps that are executed on the CPU are inherently serial and cannot be efficiently mapped to a GPU.
The last stage of the algorithm consists of the selection of the optimal precision for each video component. The user can select between lossless or lossy change of precision, directly affecting the output's visual quality.

Chrominance channel subsampling
Exploiting the fact that the human visual system is relatively insensitive to chrominance variations, in order to achieve compression, the chrominance channels Co and Cg are subsampled by a user-defined factor N that directly affects the output's visual quality and the compression achieved. The chrominance channels are stored in lower resolution, thus providing compression. For the reconstruction of the chrominance channels at the decoding stage, the missing chrominance values are replaced with the nearest available subsampled chrominance values. This approach is simple and naïve but has been selected due to the significantly smaller number of (costly) memory fetches and minimal computation cost, compared to other methods like bilinear interpolation. Utilizing the nearest neighbour reconstruction approach can introduce artifacts in the form of mosaic patterns in regions with strong chrominance transitions depending on the subsampling factor. In order to address this problem, given adequate computational resources, the receiver can choose to use the bilinear interpolation approach. Figure 3 shows an example of subsampling the Co and Cg chrominance channels by various factors, while using the nearest neighbour and the bilinear interpolation approach for reconstruction. Only a small, magnified part of the "baboon" image used is shown for clarity. As demonstrated in Figure 3, subsampling by a factor of 2 or 4 does not have a drastic effect on visual quality. Further subsampling leads to visible artifacts indicating the need for an optimal trade-off between quality and compression.

Contourlet Transform decomposition of luminance channel and quality selection
The luminance channel of the frame is decomposed using the contourlet transform. Decomposition levels, as well as the filters used are user-defined and directly affect the quality of the output. Decomposition at multiple scales offers better compression while providing scalability, i.e. multiple resolutions inside the same video stream. This characteristic allows video coding algorithms to adapt to the network's end-to-end bandwidth and transmitter/receiver resources. The quality for each receiver can be adjusted without re-encoding the video frames at the source, by just dropping the encoded information referring to higher resolution than needed.
In order to achieve compression, after the decomposition of the luminance channel with the contourlet transform, a user-defined amount of the contourlet coefficients from the direc-tional subbands are dropped by means of keeping only the most significant coefficients. The amount of coefficients dropped drastically affects the output's visual quality as well as the compression ratio. Contourlet coefficients with large magnitudes are considered more significant than coefficients with smaller magnitudes. Exploiting this fact, a common method for selecting the most significant contourlet coefficients is to keep the M most significant coefficients, or respective percentage, while dropping all the others [2] (coefficient truncation). This procedure leads to a large number of zero-valued sequences inside the elements of the directional subbands, a fact exploited by using run length encoding in order to achieve even higher compression. Considering the values and the distribution of contourlet coefficients at the directional subbands, only the zero-valued coefficients are run length encoded along the horizontal direction. Compression gained by run length encoding of all the different values is minimum and does not justify the increased computational cost. It is worth mentioning that dropping all the contourlet coefficients is similar to lowering the luminance channel's resolution while applying a lowpass filter and then, at the decoding stage, upscaling it without reincorporating the high frequency content. Keeping only the most significant contourlet coefficients also provides a means to suppress the noise induced by low-quality sensors usually encountered in web-cameras. Random noise is largely unstructured and therefore not likely to generate significant contourlet coefficients [2]. As a result, keeping only the most significant contourlet coefficients provides enhanced visual quality, which is a highly desirable characteristic since no additional filtering of the video stream is required in order to reduce the noise level. On Figure 4, an example of smoothing due to the dropping of contourlet coefficients is shown. Mosaicing artifacts and noise introduced due to the low quality of the web camera's sensor are suppressed and replaced by a fuzzier texture, resulting in a smoother and more perceptually acceptable image.
At the contourlet transform decomposition stage, 32 bit single precision floating point elements are used in order to avoid rounding errors and precision loss. Experiments with the precision allocated for the contourlet coefficients showed that the contourlet transform exhibits resistance to quality loss due to arithmetic precision reduction. This fact is exploited in order to achieve better compression by reducing the precision of the contourlet coefficients through rounding to a specific decimal point. Visual quality is not affected at all when only one decimal or more are kept. Rounding to the integer provides a PSNR of more than 60 dB when only the directional subbands' coefficients are rounded. Additionally, also rounding the low pass content provides a PSNR of more than 55 dB. In both cases, the loss of quality cannot be perceived by the human visual system and is considered as insignificant. For these experiments, the images were first transformed into the YCoCg colour space. Then the luminance channel was decomposed using the contourlet transform and the contourlet coefficients were rounded. No alteration was done to the chrominance channels. After the manipulation of the contourlet coefficients, the luminance channel was reconstructed and the image was transformed back into the RGB colour space.

Frame types
As mentioned before, frames are divided into keyframes and internal frames, with an internal frame being the difference between the current frame and the respective keyframe. Consecutive frames tend to have small variations, with many identical regions. This fact can be exploited by calculating the difference between a frame and the keyframe. This procedure provides components with large sequences of zero values leading to improved compression through the run length encoding stage. Especially in the case of video-conferencing or surveillance video, the background tends to be static, with slight or no variations at all. The occurrence of static background leads to many parts of the consecutive frames to be identical. As a result, calculating the difference of each frame from its respective keyframe provides large sequences of zero values leading to improved compression when run length encoding is applied. Run length encoding of the difference of contourlet-transformed images is even more efficient, since static noise is drastically suppressed by the coefficient truncation. Experiments showed that the optimal compression is achieved for a relatively small interval between keyframes, in the region of 5-7 internal frames, providing small groups of pictures (GOP) that depend to a keyframe. This characteristic makes the algorithm more resistant to packet loses when transmitting over a network. In the case of a scene change, consecutive frames drastically differ from each other and the compression achieved for the internal frames until the next keyframe is similar to that of a keyframe. If this scenario occurs, having small intervals between consecutive keyframes reduces the number of non optimally encoded frames. Nevertheless, in cases where the video is expected to be mostly static, like surveillance video for example, a larger interval between keyframes will provide considerably better compression.

Other supported colour spaces
Except for the YCoCg colour space, the algorithm supports the YCbCr and Greyscale colour spaces without the need to alter its core functionality. The process of encoding greyscale videos consists of handling the video as a colour video with only the luminance channel. All the steps of the algorithm are calculated except for those referring to the chromatic channels. On the other hand, due to the similarity of the YCbCr colour space with the YCoCg colour space the algorithm remains the same. The only difference is the RGB-to-YCbCr conversion at the encoder and the YCbCr-to-RGB conversion at the decoder. The luminance channel is identically handled as in the YCoCg-based algorithm, and the same holds for the replacement of CoCg channels with the CbCr ones. Nevertheless, the CbCr channels have a different range of values compared to CoCg channels. As a consequence, the optimal precision for the CbCr channels differs from that of the CoCg channels and has to be taken into consideration.

Quality and performance analysis
For evaluating the presented algorithm, two videos were captured using a VGA web camera that supported a maximum resolution of 640x480 pixels. Low resolution web cameras are very common on everyday personal computer systems showcasing the need to design video encoding algorithms that take into consideration the problems arising due to low-quality sensors. The videos captured were a typical video-conference sequence with static background showing the upper part of the human body and containing some motion, and a surveillance video with almost no motion depicting the entrance of a building.
The captured videos were encoded using the YCoCg, YCbCr and Greyscale colour spaces. The chrominance channels of the colour videos were subsampled by a factor of 4 and the video stream contained two resolutions: the original VGA (640x480) as well as the lower QVGA (320x240). The method utilized for the reconstruction of the chrominance channels was the nearest neighbour method. The percentage of the most significant contourlet coefficients of the luminance channel retained was adjusted for each encoded video, providing results of various quality and compression levels. Furthermore, at each scale, the luminance channel's high frequency content was decomposed into four directional subbands. In order to test the algorithm using the YCbCr colour space, the RGB to YCbCr conversion formula for SDTV found in [18] was utilised. For the Greyscale colour space, the aforementioned videos were converted from RGB to greyscale using the standard NTSC conversion formula [18] that is used for calculating the effective luminance of a pixel: The sample videos were encoded using a variety of parameters. The mean PSNR value for each video was calculated based on a set of different percentages of contourlet coefficients to be retained. The compression ratios achieved when using the scheme that incorporates both key frames and internal frames and when compressing all the frames as keyframes were also calculated. The interval between the key frames was set to five frames for the video-conference sample video and to twenty frames for the surveillance video. Detailed results are shown on Tables 2, 3 and 4 while sample frames of the encoded videos utilizing the YCoCg, YCbCr and Greyscale colour space for a set of settings are shown on Figures 5-10.
Examining the compression ratios achieved, it is shown that utilizing the keyframe and internal frame scheme outperforms the naive method of encoding all the frames the same way, as expected. However, the selection of an efficient entropy encoding algorithm that will further enhance the compression ability of our algorithm is still an open issue. Another interesting observation is that the contourlet transform exhibits substantial resistance to the loss of contourlet coefficients. Even when only 5% of its original coefficients are retained, the visual quality of the image is not seriously affected. This fact underlines the efficiency of the contourlet transform in approximating natural images using a small number of descriptors and justifies its utilization in this algorithm. The slightly lower PSNR achieved for the surveillance video sample can be explained due to the higher complexity of the scene compared to the video conference sample. More complex scenes contain higher frequency content, a portion of which is then discarded by dropping the contourlet coefficients.        Considering the YCoCg and the YCbCr colour spaces, for the two video samples tested, it is shown on Tables 2-4 and Figures 11 and 12  Run-length decoding of directional subbands 7.008 Contourlet coefficients dropping 0.492 Table 5. Average execution times (in milliseconds) for the basic operations of the algorithm for a 640x480 video frame. The chrominance channels were subsampled by a factor of 4 and the video stream contained the original VGA (640x480) as well as the lower QVGA (320x240) resolution.

Conclusions
In this chapter, a low complexity algorithm for real-time video encoding based on the contourlet transform and optimized for video conferencing applications and surveillance cameras has been presented and evaluated. The algorithm provides a scalable video compression scheme ideal for video conferencing content as it achieves high quality encoding and increased compression efficiency for static regions of the image, while maintaining low complexity and adaptability to the receivers resources. A video stream can contain various resolutions avoiding the need for reencoding at the source. The receiver can select the desired quality by dropping the components referring to higher quality than needed. Furthermore, the algorithm has the inherent ability to suppress the noise induced by low-quality sensors, without the need of an extra denoising or image enhancement stage, due to the manipulation of the structural characteristics of the video through the rejection of insignificant contourlet transform coefficients. In the case of long recordings for surveillance systems, where higher compression is needed, the visual quality degradation is much more eye-friendly than with other well established video compression methods, as it introduces fuzziness and blurring instead of artificial block artifacts, providing smoother images and facilitating image rectification/recognition procedures. Additionally, due to the relatively small GOPs, the algorithm is more resistant to frame losses that can occur during transmis-sion over IP networks. Another advantageous characteristic of the presented algorithm is that its most computationally intensive parts are calculated on the GPU. The utilization of the usually "dormant" GPU computational power lets the CPU to be utilized for other tasks, further enhancing the multitasking capacity of the system and enabling the users to take full advantage of their computational capabilities. The experimental evaluation of the presented algorithm provided promising results. Nevertheless, in order to compete for compression efficiency with state of the art video compression algorithms, a highly efficient entropy encoding scheme has to be incorporated to the algorithm. Modern entropy encoding methods tend to be complex and computationally intensive. As a result, the optimal trade-off between compression rates and complexity has to be decided in order to retain the low complexity and real time characteristics of our algorithm.