Efficient Coding Tree Unit (CTU) Decision Method for Scalable High-Efficiency Video Coding (SHVC) Encoder Efficient Coding Tree Unit (CTU) Decision Method for Scalable High-Efficiency Video Coding (SHVC) Encoder

High-efficiency video coding (HEVC or H.265) is the latest video compression standard developed by the joint collaborative team on video coding (JCT-VC), finalized in 2013. HEVC can achieve an average bit rate decrease of 50% in comparison with H.264/AVC while still maintaining video quality. To upgrade the HEVC used in heterogeneous access networks, the JVT-VC has been approved scalable extension of HEVC (SHVC) in July 2014. The SHVC can achieve the highest coding efficiency but requires a very high computational complexity such that its real-time application is limited. To reduce the encoding complexity of SHVC, in this chapter, we employ the temporal-spatial and inter-layer correlations between base layer (BL) and enhancement layer (EL) to predict the best quadtree of coding tree unit (CTU) for quality SHVC. Due to exist a high correlation between layers, we utilize the coded information from the CTU quadtree in BL, including inter-layer intra/residual prediction and inter-layer motion parameter prediction, to predict the CTU quadtree in EL. Therefore, we develop an efficient CTU decision method by combing temporal-spatial searching order algorithm (TSSOA) in BL and a fast inter-layer searching algorithm (FILSA) in EL to speed up the encoding process of SHVC. The simulation results show that the proposed efficient CTU decision method can achieve an average time improving ratio (TIR) about 52–78% and 47–69% for low delay (LD) and random access (RA) configurations, respectively. It is clear that the proposed method can efficiently reduce the computational complexity of SHVC encoder with negligible loss of coding efficiency with various types of video sequences.


Introduction
With the advanced researches of electronic technology, the panels of 4K × 2K (or 8K × 4K) highresolution have become the main specification of large size digital TV in future. On the other hand, with rapid development of Internet and mobile devices, more and more people browse high-quality video content by smart phone or laptop, which greatly enrich people's lives. However, the currently state-of-the-art video coding standard H.264/advanced video coding (AVC) is difficult to support the video applications of high definition (HD) and ultrahigh definition (UHD) resolution. Therefore, a new video coding standard called high-efficiency video coding (HEVC) has been standardized by the Joint Collaborative Team on Video Coding (JCT-VC) jointly established by the ITU-T and ISO/IEC to satisfy the UHD requirement in January 2013, and the first edition of HEVC was approved as ITU-T H.265 and ISO/IEC 23008-2 by JCT-VC [1]. The goal of H.265/HEVC is to achieve roughly 50% bitrate reduction over H. 264/AVC while still maintaining video quality [2][3][4][5][6]. The HEVC adopts the quadtree-structured coding tree unit (CTU), and each CTU allows recursive splitting into four equal coding units (CUs) where each CU can have the prediction unit (PU) and transform unit (TU). The HEVC can achieve the highest coding efficiency but requires a very high computational complexity so that it is difficult to be used for real-time applications. On the other hand, traditional client-server video streaming has been unable to satisfy people's ever-growing demands for video applications using heterogeneous devices and networks including the Internet and mobile network nowadays. To overcome this problem, scalable video coding (SVC) can provide an attractive solution using a single bitstream to simultaneous serve various devices with different display resolution and image fidelities. Therefore, to upgrade the HEVC further used in heterogeneous access networks, the JVT-CT develops a scalable extension of HEVC (SHVC) and is finalized in July 2014 [7,8]. SHVC mainly includes spatial scalability, temporal scalability and quality/signal-to-noise ratio (SNR) scalability. Based on the HEVC, the SHVC scheme supports multi-loop solutions by enabling different inter-layer prediction (ILP) mechanisms [9][10][11][12]. Although the SHVC can achieve the highest coding efficiency, it requires a higher computational complexity than HEVC standard. As a result, the very high encoding complexity of SHVC has become a main obstruction for the real-time services.
To reduce the computational complexity of the SHVC encoder, there are many fast methods with negligible losses of image quality have been proposed recently [13][14][15][16][17]. Tohidypour et al. reduced the coding complexity of spatial or SNR/quality/fidelity scalability in SHVC using an adaptive range search method according to statistical properties [13][14][15][16]. Bailleul et al. speeded up the encoding process in enhancement layer (EL) using a fast mode decision for SNR scalability in SHVC [16]. Qingyangl et al. also proposed a fast encoding method using maximum encoding depth based on the correlation between the base layer (BL) and EL for SNR scalability in SHVC encoder and greatly reduce encoding time in BL and EL, respectively [17]. Although these methods can reduce the complexity of the encoding process for SHVC in different level with different complexity calculation method, their methods are used only in the correlation of CU depth and modes existing in BL and EL. So, the complexity of the whole encoder still has the room to be further reduced.
To overcome the drawback of huge encoding computation in SHVC, we firstly propose a temporal-spatial searching order algorithm (TSSOA) to speed up the encoding procedure in BL. Second, we develop a fast inter-layer searching algorithm (FILSA) in EL to predict the CTU quadtree structure. There are five encoded temporal-spatial causal neighbouring CTUs are chosen to be predicted by the TSSOA in BL, which shows the searching priority order according to the correlation values which are determined by values of statistic. Due to the less data information and high correlation existing in residual image in EL, thus only three encoded inter-layer causal neighbouring CTUs are chosen to be predicted by the FILSA in EL.

SHVC background
HEVC can greatly improve coding efficiency by adopting hierarchical structures of CU, PU and TU. The CU depths can be split by coding quadtree structure of four level, and the CU size can vary from largest CU (LCU: 64 × 64) to the smallest CU (SCU: 8 × 8). The CTU is the largest CU. During the encoding process, each CTU block of HEVC can be split into four equally sized blocks according to inter/intra-prediction in rate-distortion optimization (RDO) sense. At each depth level (CU size), HEVC performs motion estimation and compensation (ME/MC), transforms and quantization with different size. The PU module is the basic unit used for carrying the information related to the prediction processes, and the TU can be split by residual quadtree (RQT) at maximally three level depths which vary from 32 × 32 to 4 × 4 pixels. The relationship of hierarchical CU, PU and TU coding structure of HEVC is shown in Figure 1 [2][3][4][5][6]. In general, intra-coded CUs have only two PU partition types including 2N × 2N and N × N but inter-coded CUs have eight PU types including symmetric blocks (2N × 2N, 2N × N, N × 2N, N × N) and asymmetric blocks (2N × nU, 2N × nD, nL × 2N, nR × 2N) [4]. When only using symmetric PU blocks, H.265/HEVC encoder tests seven different partition sizes including SKIP, inter 2N × 2N, inter 2N × N, inter N × 2N, inter N × N, intra 2N × 2N and intra N × N for inter-slice as shown in Figure 2. The rate distortion costs (RDcost) have to be calculated by performing the PUs and TUs to select the optimal partition mode under all partition modes for each CU size. Since all the PUs and available TUs have to be exhaustively searched by rate-distortion optimization (RDO) process for an LCU, H.265/HEVC dramatically increased computational complexity compared with H.264/AVC [4,5]. The optimization of the block mode decision procedure will result in the high computational complexity and limit the use of HEVC encoders in real-time applications. Since the coding procedure for HEVC is very complex, the coding procedure for SHVC is even more complex due to an extension of HEVC. Based on the HEVC, the SHVC scheme supports both single-loop and multi-loop solutions by enabling different inter-layer prediction (ILP) mechanisms [18,19]. A typical architecture of two-layer SHVC encoder is shown in Figure 3. However, SHVC encoder allows one BL and more than one EL. Figure 3 illustrates how the decoded BL picture is used for prediction in EL coding in a two-layer SHVC encoder. The input video of BL can be encoded or decoded with HEVC coding tools. The decoded picture of BL is processed by the ILP module before being sent to the decoded picture buffer (DPB) of EL. For the EL, the BL decoded picture which obtained by ILP is called as the inter-layer reference (ILR) picture. The ILP module performs inter-layer intra/residual prediction and inter-layer motion parameter prediction by upsampling calculations. Furthermore, the discrete cosine transform/quantization (DCT/Q) and inverse DCT/inverse quantization (IDCT/IQ) modules are further applied to inter-layer prediction residues for better energy compaction. The parameters used for such EL, shown as ILP information in Figure 3, are multiplexed together with BL and EL bitstreams to form an SHVC bitstream. For spatial scalability, the input high-resolution video sequence should be down-sampled to get the low-resolution video sequence, but for SNR scalability, BL and EL layer uses the same resolution video sequence. Therefore, there are larger redundancies between different layers for quality (SNR) scalability. From the Reference [18], we can find that the encoding complexity of HEVC is higher than that of H.264/AVC encoder. Therefore, the computational burden of SHVC encoder is expected to be several times more than HEVC encoder. Nowadays, it is an important topic to study how to reduce the computational complexity of SHVC to achieve real-time applications.

Proposed CTU decision method
Each layer encoding process in SHVC can be considered similar with HEVC, except for the enhancement layers using inter-layer prediction techniques. However, the computational complexity of the HEVC encoder increases dramatically due to its recursive quadtree representation to find the best CTU partition. Therefore, we can know that the computational complexity of SHVC encoder is more than HEVC encoder. Thus, we utilize the temporal-spatial correlation prediction in BL based on HEVC and inter-layer correlation prediction in EL to develop an efficient CTU decision method to speed up SHVC encoding process.

Temporal-spatial correlation in BL
As the frame rate highly increasing, the successive two frames have a stronger temporal-spatial correlation. Figure 4 shows two certain frames of the test sequence encoded using low-delay (LD) configuration in BL by the SHVC reference software (SHM 6.0) [21]. As shown in Figure 4, the quadtree structures of the CTU in the current frame, for example Figure 4 (A 0 ) and 4(A′ 0 ), are the same as or similar to the split quadtree structures of the temporally colocated coded CTUs of the previous frame shown in Figure 4. On the other hand, there are also the same as or similar to the split structures of the spatial four neighbour CTUs in the current frame, for example Figure 4(B-E). Figure 4 shows the corresponding five causal encoded neighbouring CTUs (A-E) of the current CTU(X) in the temporal-spatial direction. As observed and described above, there is always a high correlation existing encoded frames in BL. In order to show the temporal-spatial correlation existing successive frames in BL, we made statistical analysis about the optimal quadtree structure of encoded CTU in BLs. Figure 5 shows the corresponding five causal encoded neighbouring CTUs (B A ∼B E ) of the current CTU(X) in the temporal-spatial direction in BL, respectively.  Table 1 shows the probability distribution of the same CTU quadtree between temporal-spatial neighbouring and current CTU in BL using quantization parameter QP BL = 32 and 100 frames in the SHM 6.0. From Table 1, we can find that there is a high temporal-spatial correlation of quadtree exists between two successive frames. Thus, when encoding the current frame in BL, the current CTU can be predicted through the split quadtree structure of the co-located CTU in the reference frame and the split quadtree structure of the spatial four already encoded neighbouring CTUs in the current frame.

Inter-layer correlation between BL and EL
As described in Section 2, there is always a strong inter-layer correlation when adopting layerbased encoding structure. In the same situation for SHVC, we can expect that there exists a high inter-layer correlation between BL and EL when using quality scalability configuration, which BL and ELs have the same resolution with different QP. In order to find the inter-layer correlation between BL and EL, we statistically analyse the split quadtree structures of encoded CTU in BL and EL with different video sequences. In this experiment, we find that there exists a high inter-layer correlation between BL and EL. The results we got are similar to temporalspatial correlation in BL. Figure 6 shows the examples of the quadtree structures of CTU between BL and EL in the same frame. As shown in Figure 6, the quadtree structures of the CTU in the BL, for example Figure 6(X 0 ) and (X 1 ), are the same as or similar to the split quadtree structures of the corresponding co-located coded CTUs, Figure 6 (X′ 0 ) and (X′ 1 ) in the EL.   In the same procedure as BL, to show the inter-layer correlation existing in the same frame between BL and EL, we made statistical analysis about the optimal quadtree structure of encoded CTU between EL and BL. In addition, we also made statistical analysis for the temporal-spatial correlation existing successive frames in EL at the same time. Table 2 shows the probability distribution of the same CTU quadtree between BL and EL using QP (BL, EL) = QP (32,28) in the SHM 6.0. In the same situation, we can find that there is a high inter-layer correlation exists between BL and EL. Since there is a high correlation between BL and EL, the encoded CTU quadtree of the BL frames can be utilized to speed up the process of selecting the best predicted CTU quadtree for the corresponding EL frames [20]. Besides, the already encoded neighbouring CTUs in the EL are valuable for predicting the quadtree of the current CTU. Therefore, the temporal-spatial neighbouring encoded CTUs in the EL and the inter-layer corresponding encoded CTU in the BL are used to predict the current CTU in EL. From Table 2, we can find that there is a higher inter-layer correlation exists between BL and EL except for temporal-spatial correlation in EL. In addition, we also find that the probability distributions of CTU(E c ), CTU(E E ) and CTU(E′ x ) are almost the same and less than the others.
For simplicity, when encoding the current frame in EL, the current CTU(X) can be interlayer predicted by the split quadtree structure of CTU in BL and then predicted through the split quadtree structure of the two split structure of the spatial already encoded neighbouring CTUs in EL.

Temporal-spatial searching order algorithm (TSSOA)
To speed up the encoding process of SHVC in BL, we propose a temporal-spatial searching order algorithm (TSSOA) which utilizes the characteristics of natural video sequence existing strongly temporal and spatial correlation. In this work, the five causal neighbouring encoded split quadtree structures of CTUs shown in Figure 5, on temporal-spatial direction, are firstly chosen as candidates for the current CTU encoding in BL. Figure 8 shows the search priority order according to the sorted correlation values determined by experiments from Table 1. Block 1 represents the temporal neighbour, and blocks 2-5 denote spatial neighbours in horizontal, vertical, 45 and 135 diagonal directions. To determine whether a candidate split structure of the CTU is good enough for the current CTU, we check compute the RD cost using the predicted split structure. After the candidate split structure (one of blocks 1-5) is found, we check whether it is good enough for the current CTU by comparing its RD cost with a threshold (Thr). If it is less than the threshold, the candidate is good enough for the current CTU. Otherwise, it implies that the temporal-spatial correlation is low and a full recursive process is needed to find the optimal split quadtree structure of the current CTU. The flow chart of the proposed TSSOA is shown in Figure 9. The proposed TSSOA in the fast encoding for SHVC can be summarized as follows: Step 1. Set a threshold (Thr) value according to QP.
Step 2. Encode the BL of SHVC using TSSOA. If the RDcost computed by priority order 1 is less than Thr, go to step 6. Otherwise, go to step 3.
Step 3. If it is last neighbouring CTU, go to step 5. Otherwise, go to step 4.
Step 5. Use the original RDO module to prune the best CTU quadtree of the current CTU.
Step 6. Record the best CTU quadtree and corresponding parameters of BL.

Fast inter-layer searching algorithm (FILSA)
For fast EL encoding, we use the fast inter-layer searching algorithm (FILSA) between BL and EL to predict the split quadtree structure of CTU for the current CTU in EL. Due to the less data information and very high correlation existing in residual image in EL, thus only three causal neighbouring split quadtree structure of CTUs shown in Figure 10 are chosen as the candidates. This is because we find that there is a highest inter-layer correlation existing CTU(B A ), CTU(E B ) and CTU(E D ). In other words, we eliminate CTU(E C ), CTU(E E ) and CTU(E x′ ) as candidates since their probability distributions are almost the same and less than the others. Therefore, when encoding the current frame in EL by FILSA, the current CTU(X) can be interlayer predicted by the split quadtree structure of CTU in BL and then predicted through the two split quadtree structure of the spatial already encoded neighbouring CTUs in EL. The FILSA determines that split quadtree structure of CTUs is the best candidate for the current CTU in EL, and it computes the RD costs from the predicted split quadtree of CTUs and selects the minimum RD cost as the best split quadtree of CTU(E X ). From our experiments, we can verify the encoding performance with negligible decrease when only utilizes three candidates as shown in Figure 10.

Fast SHVC encoder
Based on the proposed TSSOA and FILSA in BL and EL encoding procedure, respectively, we can develop a fast SHVC encoder using efficient CTU decisions. First, we utilize the TSSOA to speed up the encoding procedure in BL. Second, we employ the FILSA to predict the CTU quadtree structure in ELs. Therefore, we can implement an early termination (ET) for split quadtree search using an efficient CTU decision method based on combining the proposed TSSOA and FILSA. The proposed SHVC encoder does not need to go through all the modes, thus significantly reducing the computational complexity. The flow chart of the proposed fast SHVC encoder is shown in Figure 11.

Simulation results and discussion
For the performance evaluation, we assess the total execution time of the proposed method in comparison with those of the SHM 6.0 [21] in order to confirm the reduction in computational complexity. The system hardware is Intel (R) Core(TW) CPU i5-3350P @ 3.10 GHz, 8.0 GB memory, and Window XP 64-bit O/S. Additional details of the encoding environment are described in Table 3.  The performance of our proposed complexity reduction method is compared with that of the unmodified SHVC encoder in terms of encoding time, impact on bitrate and peak signal-tonoise ratio of Y component (PSNRY). Note that for each video sequence, the encoding time is reported for the total time (BL + EL). The coding performance is evaluated based on ΔBitrate, ΔPSNRY and time improving ratio (TIR), respectively, which are defined in Eqs. (1-3) and described as follows: where the ratio of encoding bitrate reduction is represented by ΔBitrate, and Bitrate proposed and Bitrate SHM 6.0 represent the encoding bitrate of the proposed method and the conventional method based on the SHM 6.0 reference software, respectively.

Test sequences
where ΔPSNRY is the ratio of encoding quality reduction, and PSNRY proposed and PSNRY HM 6.0 represent the proposed method and the SHM 6.0, respectively. where TIR is the ratio of encoding time reduction, and TIME proposed and TIME SHM 6.0 represent the proposed method and the SHM 6.0, respectively. Encoding time is usually used to measure the computational complexity of the SHVC encoder, and thus, a TIR measurement is adopted to assess our proposed fast method.
The value of the threshold (Thr) for TSSOA is an important parameter in BL encoding, which affects the coding performance of the proposed algorithm. A lower value of means that more RDOs are performed to prune the best CTU quadtree, and thus, more time is spent to encode them and a closer quality to that of SHM 6.0 will be obtained. However, since the proposed fast algorithm is very desirable for achieving a real-time implementation of SHVC encoder, we focus on the improvement performance of encoding time. We have conducted several experiments with different values of Thr to study the effect of varying t on the resulting TIR for test sequences shown in Table 3. Figure 12 shows the average curve of TIR vs. Thr QP for QP BL = 32 which indicates that the TIR is approximately the same for all ≥ 350,000. From our experiment results, we find that there are high dependent relationships existing resulting curves with various QPs. Since different QP BL s could yield different average curves for TIR vs. Thr QP , the thresholds are expected to be QP-dependent. Furthermore, it can be easily observed form our intensive experiments that there is a linear relationship between the threshold values and the various QP BL values. To mathematically model this relationship which essentially performs polynomial fitting to approximate a linear function, a linear regress model is used to derive the formula as [20] (4) where λ = 0.4845 × 2 (QP-12)/3 is defined in SHVC specification [5]. Tables 4-7 tabulate the performances obtained by testing the SHM 6.0 and the proposed method with different quantization parameter pairs when uses the random access (RA) and LD scenarios, separately. The simulation results show that the proposed algorithm can reduce the computational complexity of CTU quadtree pruning of SHVC about 34∼71% when compared to SHM 6.0. From Tables 4-7, we find that the proposed fast SHVC encoder can further achieve an average TIR about 47∼78%. In addition, we can observe that the encoding time improving is more efficient when the value of QP pairs increases. This is because the quantization error is too large that results in the lower temporal-spatial and inter-layer correlation. Furthermore, as can be seen in Tables 4-7, they also show that the TIR of CU module for Kimono and BasketballDrive sequences tested by different methods with different QP values has higher encoding reduction improvement. This is because backgrounds of these two sequences are slowly changed and the movements are rather homogenous.  Table 7. Comparison of the proposed method with SHM 6.0 using QP (40,36) .
In summary, the results show the superiority of our proposed fast efficient CTU decision including TSSOA and FILSA over the state-of-the-art unmodified SHVC method.

Conclusions
In this chapter, we proposed a fast encoding method using temporal-spatial correlation and inter-layer correlation to reduce the encoding complexity for quality SHVC. In our scheme, the split quadtree information of the BL is utilized to facilitate the prediction of split CTU quadtree selection process in the ELs by avoiding redundant computations. Performance evaluations show that our approach results in significant SHVC coding complexity reduction (up to 77.74%, on average) while minimally hampering the overall bitrate.