Fast Motion Estimation ’ s Configuration Using Diamond Pattern and ECU, CFM, and ESD Modes for Reducing HEVC Computational Complexity

The high performance of the high efficiency video coding (HEVC) video standard makes it more suitable for high-definition resolutions. Nevertheless, this encoding performance is coupled with a tremendous encoding complexity compared to the earlier H264 video codec. The HEVC complexity is mainly a return to the motion estimation (ME) module that represents the important part of encoding time which makes several researches turn around the optimization of this module. Some works are interested in hardware solutions exploiting the parallel processing of FPGA, GPU, or other multicore architectures, and other works are focused on software optimizations by inducing fast mode decision algorithms. In this context, this article proposes a fast HEVC encoder configuration to speed up the encoding process. The fast configuration uses different options such as the early skip detection (ESD), the early CU termination (ECU), and the coded block flag (CBF) fast method (CFM) modes. Regarding the algorithm of ME, the diamond search (DS) is used in the encoding process through several video resolutions. A time saving around 46.75% is obtained with an acceptable distortion in terms of video quality and bitrate compared to the reference test model HM.16.2. Our contribution is compared to other works for better evaluation.


Introduction
The fast multimedia technology development and network communications makes ultrahigh-definition (HD) and HD video contents widely used in our daily life. This fast jump to use high video resolutions in which many provide some problems in terms of memory storage cost and transmission bandwidth gives birth to the new high efficiency video coding (HEVC) [1,2]. HEVC is developed in 2013 by the joint collaborative team on video coding (ISO/IEC) Moving Picture Experts Group (MPEG) and the International ITU-T Video Coding Experts Group (VCEG). It is urbanized to overcome the enormous amount of UHD video contents. Compared to the earlier H.264/AVC [3] standard and at the identical visual quality, HEVC guarantees a high encoding performance, reaching 50% of bitrate [4]. Facing to this immense huge encoding performance, a huge computational complexity is obtained. Motion estimation (ME) represents the large part of encoding process that occupies around 70% of the total time of inter prediction, as Jungho [5] indicates in Figure 1.
This large consumption is principally due to the new hierarchy of the block coding based on coding tree units (CTU). This new concept is analog to macroblocks in the earlier standard of compression. Each picture frame is divided into square forms, called coding units (CUs) [6], where 64 Â 64 represents the maximum size, and recursively subdivided into 8 Â 8 blocks. Prediction and  transform blocks (PUs and TUs) are in each CU, where PU represents the principal unit in the ME process. Figure 2 shows the CTU tree structure in the HEVC standard where LCU represents the large coding unit and SCU represents the small coding unit.
When reducing the time essential for the search algorithm, the ME computational complexity will be automatically reduced. Furthermore, when using different fast mode decision algorithms based on early termination, the ME computational complexity will be reduced, which primes to the entire HEVC execution time reduction.
It is within this context that this article presents a fast encoding algorithm principally based on the early skip detection (ESD), the coded block flag (CBF) fast method (CFM), and the early CU termination (ECU) modes [7][8][9] to decrease the HEVC encoding complexity.
The remainder of this paper is structured as follows: the next section details some works on the HEVC fast motion estimation algorithms. Section 3 provided an overview of the motion estimation algorithm. Section 4 highlights the proposed fast configuration for the HEVC encoder. Experimental results for the fast HEVC configuration compared to the results obtained with the original HM16.2 reference software [10] are discussed in Section 5. Finally, in Section 6, conclusions and some prospects are given.

Related works
Aiming to optimize the HEVC encoder complexity, several works have been proposed to reduce the test zonal search (TZS) motion estimation algorithm. Some works are interested in hardware solutions, and others are focused on software optimizations.
In [11], using sequential and parallel techniques, two hardware diamond architectures for HEVC video coding are proposed. These architectures achieve an encoding in full HD at 30 frames per second using a Virtex-7 field programmable gate array (FPGA) design.
Authors in [12] have proposed a hardware parallel sum of absolute difference (SAD) design for gray-scale images to reduce motion estimation time for block size of 4 Â 4 pixels. A multiplier is exploited for addition as a partial product reduction (PPR). Results obtained on Virtex-2 Xilinx FPGA show that the maximum frequency obtained is 133.2 MHz for 4 Â 4 block size. Nalluri et al., in [13,14], have proposed two other SAD architectures on FPGA Xilinx Virtex without and with parallelism. The proposed parallel architecture has accelerated the SAD calculation by 3.9Â compared to the serial SAD architecture. In [15], authors have proposed two implementations of the SAD and SSD algorithms using NVIDIA GeForce GTX480 with CUDA language in order to reduce the ME run-time. The proposed architecture saved about 32% of encoding time for class E video sequences with nonsignificant degradation in the PSNR and the bitrate.
Regarding software solutions, the 8-point square and the 8-point diamond have been replaced by Nalluri et al. [16] with a 6-point hexagonal in the TZS ME algorithm, and 50% in encoding time is saved without degradation in bitrate and PSNR. To replace the TZS algorithm, in [17,18], authors proposed small diamond pattern search (SDPS), large diamond pattern search (LDPS), and horizontal diamond search (HDS). Experiments using HM8.0 showed that these algorithms allow a reduction of 49% of motion estimation calculation time with nonsignificant increase in bitrate and slight degradation in video quality.
In [19], Liquan et al. have proposed a fast mode decision algorithm by skipping some depths. The proposed work allows saving about 21.5% of encoding time with a slight bitrate increase and a negligible efficiency loss coding. The algorithm proposed by Qin [20] uses the ECU algorithm according to an adaptive MSE threshold value. This work ensures time saving without degradation in the quality. Podder [21] has also proposed an interesting software method to reduce the ME time. Based on human visual features (HVF), an efficient decision of the appropriate block partitioning mode has been obtained. This work allows saving 41.44% of the execution time for SCVS video sequences. In the work published in [22], a fast HEVC ME based on DS and three fast mode decisions, ECU, ESD, and CBF modes, have been presented. Simulation results show a reduction of 56.75% in the complexity of HEVC in terms of execution time, accompanied with slight degradation in video quality and bitrate, when comparing the HM.16.2 executed on an Intel® Core TM i7-3770 @3.4 GHz processor. Authors in [22] have tested just one sequence from each class with just two quantification parameters (QPs), QP = 22 and 37, to evaluate the use of the fast modes.
By analyzing all these previous works, we can note that using fast mode decision algorithms represents an interesting technique in order to reduce the HEVC computational complexity.

Overview of the motion estimation in the new HEVC
TZ Search algorithm, used in HEVC ME process (Figure 3), includes four distinct main stages in order to determine the best motion vector.
These stages, which are the motion vector prediction (MVP), the first search performed with a pattern of square or diamond forms, the refinement, and the raster search, are described in the next subsections.

Motion vector prediction (MVP)
To compute the corresponding block's median predictor, the TZS algorithm uses the up predictor, the upper right predictor, and the left predictor (Figure 4). The median computation is done via the following equation.

Initial grid search
The first search is performed by the determination of the search pattern and the "searchrange." As it is detailed in Figure 5(a) and (b), the main goal of this stage is to localize the search window via a pattern of square or a diamond forms.
Thus, these two search patterns are referred to the eight points for each round. The distance corresponding to the minimum distortion point is saved in the "BestDistance" variable. Currently, diamond search pattern is used as default, but the square pattern search can also be used by modifying the HEVC configuration file through the "Diamondsearch" variable.

Raster search
This step consists of choosing the distance which corresponds to the greatest matched point from the previous search. Three cases according to this distance denoted as "BestDistance" are summarized as follows: • The process is stopped when "BestDistance" = 0.
In the configuration file, "iRaster" represents a changeable variable not to be overdone.
• BestDistance > iRaster is agreed correctly; a raster scan is achieved using the iRaster value as the length step. If difference obtained from the starting station to the MV from the first level is besides large, this step is preceded. This step is computed on the entire search window. Figure 6 shows an example of a full search algorithm with iRaster which is equal to 4.

Raster and star refinement
The refinement is performed when the distance of the motion vector previously obtained is different to 0. There are mainly two refinement types: • Raster refinement The best point obtained from the previous steps corresponds to the start point of the star refinement. It can be performed using a diamond or a square pattern with distances ranging from "search range" to one. In each iteration, the distance is divided by 2, and when the distance will be equal to one, two adjacent point searches are performed, and then the process is stopped.
• Star refinement In this step, the selected point obtained from the previous steps corresponds to the start point of the star refinement. In each iteration, the distance is divided by two, and when the distance will be equal to one, two adjacent point searches are applied to determine the best estimated MV which gives the minimum of SAD (Figure 7).

The proposed fast configuration
Several fast decision mode algorithms are in this effort aiming to speed up the ME process. Firstly, diamond search pattern is utilized to decrease the encoder computational complexity. Some configurations are also set, such as the early CU termination (ECU), the early skip detection (ESD), and the coded block flag (CBF) in which fast decision mode algorithms are adopted in HEVC video coding. These proposed fast algorithms were given bellow.

Early CU termination (ECU)
This algorithm is used when switching from a depth p to the next p + 1. As  Figure 8 showed, if skip is the best current CU prediction mode, the sub-tree calculations can be skipped [23]. Thus, good mode is determined with rate distortion (RD) calculation cost [24]. The minimal RD cost relates to the skip mode that caused the stop of the partitioning [25].
Several works show that the most chosen mode was the skip [25]. This clarifies the detail that an excessive enhancement is obtained when the skip mode recognition is anticipated. This mode induces a better encoder performance since it denotes a block code deprived of residual information.

Early skip detection (ESD)
The early skip detection signifies a modest verification of the two-variance motion skip conditions (CBF and differential motion vector (DMV)). As shown in Figure 9, this verification is performed after determining the best inter 2 N Â 2 N. Before checking the skip mode, the current CU performs two inter 2 N Â 2 N modes (advanced motion vector prediction called AMVP and merge mode). The DMV and CBF are checked when the minimum RD cost is induced by the mode selection. When CBF is equivalent to zero and the best mode inter 2 N Â 2 N DMV is equal to (0, 0), the skip mode is the best mode of current CU. Consequently, the residual modes of PU are not examined anymore [8].

Coded fast method (CFM)
The coded fast method (CFM) detects the best mode of a prediction unit [7]. As shown in Figure 10, for each PU mode belonging to a CU, the RD cost is calculated.
An evaluation of the different coefficients, CBF for the luminance and the two chrominances, is performed. When all transform coefficients (CBF_Y, CBF_U, and CBF_V) are equal to zero [9], all remaining modes will not be tested.

Experimental conditions
The performance evaluation of this work is effectuated with a random access (RA) configuration through the HM 16.2 reference test model, exploiting the fast mode decision algorithms ECU, ESD, and CBF, previously detailed. To appraise the fast implementation, a comparison of HEVC encoding time, bitrate, and PSNR with the original is effectuated, where a search range is 64. Sixty-four also is the CU maximal size and CU partition depth maximal equals four. An Intel® Core TM i7-3770 @ 3.4 GHz is used in this work with Windows 8 OS platform.

Evaluation criteria
To evaluate this work, we used the formula detailed in Table 1. 65.135%. This is due to the motion slowness in these sequences. Indeed, for videos containing low motion activities [18], the improvement is more significant. With the highest resolution, traffic video is characterized by intensive movement of objects against a stationary background. Concerning BQSquare, this video having fast motion is often coded by the bi-predictive mode, as it is the best prediction mode.

Results
Defiantly for sequences with high activity, such as BlowingBubbles, RaceHorses, and PeopleOnStreet, the time saving is only around 34.73 and 28.38%. The worst case is for the motion-filled and dynamic RaceHorses video, which records horse racing. Many great frequency details are in this video, since horsetail is regularly expensive to encode.
The time saving is visible with 49% for BasketballDrive sequence. This video contains a high contrast and high motion activities. The background has a rather similar texture.
Not only the encoding time was saved but also the bitrate which is justified by the negative values in the table, ranging from 0.002 to À2.182% for PartyScene and PeopleOnStreet with QP equal to 22 and 37, respectively. Regarding the quality of video, the PSNR deprivation is from À0.015 to À0.23 dB for BasketballDrive and RaceHorses with QP equal to 22 and 37, respectively.
In average, the fast HEVC configuration induces a nonsignificant poverty in terms of video quality, around 0.106 dB, with a decrease of 0.416% in the bitrate that is a very interesting point in terms of increasing the compression performance. Figure 11 shows the curves of rate distortion (RD) of HEVC original algorithm and the fast one, for two sequences for each class: PeopleOnStreet and Traffic from class A (2560 Â 1600), BQTerrace and BasketballDrive from class B (1920 Â 1080), PartyScene and BasketballDrive from class C (832 Â 480), and BlowingBubbles and BQSquare from class D (416 Â 240). This can also be checked in Table 2. The sequences are taken at QPs 22,27,32,and 37.
Four QP parameters are presented in all curves; horizontal axes on (kbps) represent the bitrate where the vertical one on (dB) represents the PSNR. Figure 11 shows that all RD curves are overlaid [27]. In fact, the proposed changes have insignificant impairments on bitrate and PSNR. For lower QP values, the degradation is more significant. Experimental results prove that the fast configuration gives better performances than the original one, given that it offers a significant time saving, without any influence on the quality and the bitrate.
Further, for all tested sequences, an important speedup is obtained for bigger QPs. Figure 12 evaluates the time saving in average by varying from 22 to 37. We note that the time saving increases in proportion to QP. In average, for higher QP, equal to 37, the run-time decreases by 63.5%. This decline is justified by the choice of the skip mode for bigger QP values [25]. Table 3 summarizes the performances of the proposed work compared to different previous algorithms. Compared to [17], the proposed work was more competent in terms of bitrate and saving time. In fact, [17] allows saving about 25.95% of encoding time with a slight bitrate. This algorithm was based on large diamond search pattern as an algorithm for motion estimation implemented on HM8.0. Concerning Liquan [19], its algorithm consists of skipping some detailed depths used in the preceding frames. This work allows saving about 21.5% of encoding time with a slight bitrate. Qin [20] implemented an algorithm established on the ECU according to a MSE adaptive threshold value. A time saving without degradation in the quality is obtained in this work. Another interesting method was presented by Podder et al. [21], where human visual features (HVF) are used for the selection of appropriate block partitioning modes. This work offered 41.44% reduction in terms of time for the standard class video sequences (SCVS).

Conclusion
HEVC induces an important progress in terms of video quality, in particular for high video resolutions. Nevertheless, this recital is combined with a bigger computational complexity which tremendously increases the encoding time. Motion estimation module using the quadtree structure represents the mainly strong process that is a conduit to the augmentation of the HEVC computational complication. In this paper to decrease this computational complexity, one fast configuration was presented to optimize the ME process by using CU partitioning fast mode decision algorithm and a diamond search. A reduction of 46.75% in the encoding time is obtained without inducing a significant degradation in encoding performance in terms of video quality or bitrate.
As perspectives, additional optimizations will be also implemented to reduce the encoder complexity via digital platform for video processing.
We will also exploit the fast configuration detailed in this paper for the new compression standard Joint Video Exploration Team (JVET) [28,29].  Table 3.
Proposed algorithm compared to previous works.