Video encoding formats .
Before broadcasting a video signal, redundant data should be removed from the transmitted video signal. This redundancy operation can be performed using many video coding standards such as H.264/Advanced Video Coding (AVC) and H.265/High-Efficient Video Coding (HEVC) standards. Although both standards produce a great video resolution, too much data are considered to be still redundant. The most exhaustive process in video encoding process is the Motion Estimation (ME) process. The more the resolution of the transmitted video signal, the more the video data to be fetched from the main memory. This will increase the required memory access time for performing the Motion Estimation process. In This chapter, a smart ME coprocessor architecture, which greatly reduces the memory access time, is presented. Data reuse algorithm is used to minimize the memory access time. The discussed coprocessor effectively reuses the data of the search area to minimize the overall memory access time (I/O memory bandwidth) while fully using all resources and hardware. This would speed up the video broadcasting process. For a search range of 32 × 32 and block size of 16 × 16, the architecture can perform Motion Estimation for 30 fps of HDTV video and easily outperforms many fast full-search architectures.
- HDTV broadcasting
- motion estimation
- video coding
- video transmission
Broadcasting is the distribution of audio or video content to a dispersed
H.264/AVC  is a video coding standard that was developed by ITU and ISO . The main advantage of H.264/AVC standard is to minimize the bit rate of the transmitted video signal. It achieves up to 50% savings in the transmitted video bit rate if compared to other previous standards (please see Figure 1). As a result, H.264/AVC standard is used in many video applications such as HD-DVD, multimedia streaming, remote video surveillance, medical image processing field, video conferencing, HDTV broadcasting, video on demand, and multimedia messaging [5, 6, 11]. Multiple reference frames, half-pel and quarter-pel accurate Motion Estimation, using small block size, exact-match transform, adaptive in-loop deblocking filter, and enhanced entropy coding methods, and variable block size techniques are used to achieve low bit rate and high resolution of the transmitted video signal .
Due to the high demand of high-resolution video applications and the traffic constrains of the network infrastructure, the offered transmission bit rate of H.264/AVC standard is not suitable to fulfill such application needs. ITU-T VCEG and ISO/IEC MPEG  developed a new video coding standard called H.265/High-Efficient Video Coding (HEVC) standard. H.265/HEVC standard was developed for three main goals. The first goal is to be able to encode high-resolution video sequences such as 4 K and ultrahigh definition (UHD). The second goal is to lower the video transmission bit rate by approximately 50% compared to the H.264/AVC standard. The last goal is to speed up the video coding and transmission process by utilizing parallel processing operations. It is worth mentioning that H.265/HEVC encoder is four times more complex than older standards [1, 14]. However, from the previous aforementioned discussions, it is clear that both H.264/AVC and H.265/HEVC standards require processing too much data in order to obtain higher compression and accordingly low bit rate. This is why the implementation of both standards is not that easy because of the high demand of huge memory size that is being able to process such huge data. This is the main bottleneck in such video coding standards, since fetching such huge data from memory will increase the memory access time and slow down the encoding process. As a result, those video coding standards may not be suitable for real-time video applications such as HDTV broadcasting.
The huge data and memory access time that are needed for HDTV broadcasting, if using either H.264/AVC or H.265/HEVC standards, are the recent problems under study for many researchers nowadays. Some techniques tried to reduce the search area size in the Motion Estimation process in order to reduce the memory access time. Adaptive search window size (ASWS) technique [15, 16] adaptively decides the size of the search area according to the motion activity of the Current Block. The authors in  used three window sizes 3 × 3, 7 × 7, and 15 × 15. The main drawback of such algorithm is the lake of video visual quality since the selected window sizes are not enough to cover fast motion activities of certain Current Blocks. In , this problem is avoided by adaptively adjusting the search window size according to some model equations that used some parameters. Those parameters changed according the motion activity of a Current Block and considering the motion activity of surrounding blocks. The accuracy to decide the search window size in  is very high, and, accordingly, the memory access time is less than the case of using the conventional full-search Motion Estimation (FSME) process. Although the previous techniques greatly reduce the memory access time, their very-large-scale integration (VLSI) implementation is not easy . They are not suitable for hardware implementation as they lose the regularity of data flow, but they can greatly reduce the computational complexity. Some straightforward VLSI architectures are used to implement the FSME of the video encoder (either for H.264/AVC or H.265/HEVC standards). Such implementation has many advantages such as they are greatly reducing the memory access time. Additionally, the data flow of their architecture is uniform. This gives such architectures simplicity in their design. The architecture given in  is a good example for such architectures. The authors in  used a smart algorithm with a simple local memory to reuse data (data reuse level A and level B) of the search area. It means that data could be fetched only once from the main memory. This greatly decreases the memory access time required for the Motion Estimation process. The design proposed in  is a flexible one, at which it can be used either for H.264/AVC or H.265/HEVC standards with slightly modifications in the hardware.
The chapter is organized as follows. Section 2 illustrates the problem formulation in details. Section 3 describes the memory I/O bandwidth reduction techniques. Motion Estimation co-processor with data reuse is described in details in Section 4.
2. Problem formulation
The most exhaustive part in video encoding process is the Motion Estimation process . As seen in Figure 2, Motion Estimation process consumes up to 53 and 70% of the entire encoding process in case of using one and four reference frames, respectively. This is due to the huge data and computations that are required to perform such operation. As a result, we emphasize on reducing both the computations and the data fetched from the main memory in order to speed up the video encoding process in order to be available for real-time HDTV broadcasting application. In Motion Estimation process, the current frame ψ should be divided into blocks called Current Blocks (please see Figure 3). The size of each block is M × M pixels. The main idea of the Motion Estimation process is to reduce the temporal redundancy in the transmitted video sequences. This can be done by searching for a best match candidate block of each Current Block within a search area in the reference frame ψ − 1 as seen in Figure 3. Referring to Figure 3, the best match candidate block can be calculated by exhaustively searching all candidate blocks allocated in every pixel in the search area. The search area is allocated at frame ψ − 1. The search area size is 2Xmax × 2Xmax, where 2Xmax is the range of the selected search area. The sum of absolute difference (SAD) metric is used to search for the best match candidate block. The pixel in the search area corresponding to the minimum SAD value (i.e., point (k,l) in Figure 3) represents the best match candidate block. Two outputs of the Motion Estimation process are seen in Figure 3. The first output is the residue between the Current Block and the best match candidate block. The second output is the displacement between the center of the search area and the best match candidate block, represented by the motion vector (MV). The MV can be represented as
where −Xmax ≤ k,l ≤ Xmax, x, and y are the center coordinates of the search area and k and l are the coordinates of the best match candidate block.
It is worth mentioning that video applications consume much data compared to some other multimedia sources such as image, speech, and text. The number of bits required to transmit video sequences depends on the video frame resolution. The lager the frame resolution is, the larger the number of bits required to be transmitted. Consequently, the transmission bit rate will also increase. Different video frame formats are shown in Table 1. Given that the number of bits/pixels required for specific video resolution is B, the number of pixel per line of one video frame is P, the number of lines per video frame is L, and the transmitted frame per second for such video resolution is F, the video transmission bit rate (BR) can be calculated as in Eq. (2):
|Frame size (pixels)||Frame rate (frames/sec)|
|HDTV broadcast||1920 × 1080||30|
|SDTV broadcast (D1)||720 × 486||30|
|Digital cinema (DC)||4096 × 2160||24 fps|
|Standard definition (SD)||720 × 486||30 fps|
|Video conferencing (SIF)||352 × 240||30|
|Internet streaming video|
(QSIF to SIF)
|176 × 144|
352 × 240
|Desktop video phone (QSIF)||176 × 144||15–30|
It is noticed from Table 1 that frame size is increased due to the increase of consumer demand for higher resolution . The more the video frame size is, the more the required search area size will be. This increases the data to be fetched from the memory. Accordingly, more memory I/O bandwidth is required. This is a big problem since the memory I/O bandwidth is limited. As an example, HDTV broadcasting requires much data to be fetched from the memory than the Internet streaming video which uses QSIF or SIF video formats. Many recent researches are directed to create different techniques for a better use of the available memory I/O bandwidth. In the following section, a review of most recent techniques that have been used for reducing the I/O bandwidth problem will be performed.
3. Memory I/O bandwidth reduction
The memory I/O bandwidth is the main problem in video encoding process. The more the data is needed from the memory, the more the required memory I/O bandwidth is. As a result, the delay of the video encoding process will be high. This will prevent the use of such video encoding process in real-time applications such as HDTV and SDTV broadcasting. To avoid such problem, the search area data required for the Motion Estimation process in a video encoder should be reused. In other words, the required data for the Motion Estimation process should be fetched only once from the main memory. This will speed up the video encoding process and allow the video encoder to be used in real-time video applications.
During the full-search Motion Estimation, the current frame is divided into nonoverlapped blocks. Each block (Current Block) has a size of M × M pixels. The selected search area has a size of SAx × SAy. The size of the Current Block and the search area is different according to the frame size. As an example, the Current Block and search area sizes for SIF video sequences are 16 × 16 and 32 × 32, while for SDTV and HDTV are 32 × 32 and 64 × 64, respectively. The search area can be divided into candidate blocks or search pixels. It is worth mentioning that the candidate blocks for the contour pixels use additional padding (please see dashed line area in Figure 4). The padding pixels will have maximum values to be sure that they will be excluded from our selections.
Considering the candidate blocks #1 and #2 in Figure 4, there will be M − 1 overlapped pixels that should be fetched twice from the main memory. If we continue horizontally, there will be a complete strip that contains several overlapped areas between the candidate blocks of the search area. The horizontal strip of overlapped Data Pixels can be reused and fetched only once from the main memory. In this case, such data is called data reuse level A. If the Motion Estimation process is proceeded by going one step down, huge Data Pixels will be overlapped. This is called data reuse level B, if fetched only once from the memory. It is worth mentioning that a smart architecture of the Motion Estimation processor is required in order to be able to reuse data and reduce the I/O memory access bandwidth. Given that the size of the Current Block is M × M, the number of Data Pixels (DP) required from the memory to perform the Motion Estimation process for only one Current Block is given by Eq. (3). There are much redundant (repeated) data that are fetched from the main memory. This increases the I/O memory bandwidth which is not practical for real-time applications such as HDTV broadcasting:
Data reuse is the most recent efficient algorithm that is used to reduce the I/O memory bandwidth while performing Motion Estimation process [4, 19, 20]. Many redundant memory access times are skipped during the Motion Estimation process while performing the data reuse algorithm. The algorithm skips the calling of the same pixels multiple times and accesses the required Data Pixels only once while performing the Motion Estimation process. This is how it reduces accessing the main memory many times, and, consequently, it reduces the I/O memory bandwidth. For a maximum reduction of the I/O memory bandwidth, there are four levels of data reuse: levels A, B, C, and D [4, 19]. The four levels are described in the following subsections.
3.1. Data reuse level A
For a horizontal strip as seen in Figure 4, there will be an overlapped Data Pixels between any two consecutive blocks. The overlapped area size is . Data reuse level A principle is achieved by loading only new data from the memory and ignoring old data that are already fetched before from the main memory [4, 19]. This requires a smart local memory in the Motion Estimation processor to keep reusing the old fetched data. As an example, seen in Figure 4, when performing the SAD operation between the Current Block and candidate block #2, only M column of Data Pixels will be fetched from the memory. This is a huge saving in the memory I/O bandwidth.
3.2. Data reuse level B
As seen in Figure 4, there is another vertical level of overlapped data if the SAD operation is performed one pixel in the down direction [4, 19]. The size of the overlapped Data Pixels is . The overlapped Data Pixels will be while performing the SAD operation between the Current Block and both of candidate blocks strip #1 and #2 as seen in Figure 4. Data reuse level B is achieved by loading only SAx Data Pixels when moving from one horizontal strip to the next strip below it.
3.3. Data reuse level C
For two horizontal consecutive search areas SA1x × SAy and SA2x × SAy, there will be an overlapped area of a size as seen in Figure 5. These overlapped Data Pixels could be fetched only once if level C data reuse is applied while performing Motion Estimation process [4, 19].
3.4. Data reuse level D
For two vertical consecutive search areas SAx × SA1y and SAx × SA2y, there will be an overlapped area of a size as seen in Figure 6. These overlapped Data Pixels could be fetched only once if level D data reuse is applied while performing Motion Estimation process . Performing data reuse levels C and D require higher complexity in the design of the Motion Estimation process to reuse data already loaded from the former search area strips to the latter search area strips. This is a trade-off between the higher design complexity and the reductions in the memory I/O bandwidth.
4. Motion estimation co-processor with data reuse
The main idea that is discussed in this section is how to design a smart Motion Estimation co-processor that could perform data reuse for a maximum reduction in the memory I/O bandwidth. So, such co-processor can be used for real-time video applications such as HDTV broadcasting. The design of the co-processor is preferred to be simple. This requires implementing only level A and level B data reuse since the other two levels require too much hardware complexity. One smart design that is modified and discussed in this chapter is the one proposed in . The modified Motion Estimation co-processor can be used for both H.264/Advanced Video Coding (AVC) and H.265/High-Efficient Video Coding (HEVC) standards. Although the used window size for such co-processor is 32 × 32 pixels with a Current Block size of 16 × 16 pixels, it can be generalized for a higher number of pixels in both the search area and the Current Block to be suitable for HD video sequences. A modified top-level architecture of the proposed Motion Estimation co-processor in  is shown in Figure 7.
This co-processor is designed for H.264/AVC; however, it could be used for H.265/HEVC standard with a little modification in the hardware. The structure in  is designed to perform the Motion Estimation process for a Current Block of size 16 × 16. The search area can be 32 × 32 or more. To increase the speed of the co-processor proposed in , the Current Block is directly loaded from the main memory, and the search area will be loaded into a local memory as will be discussed later. The structure consists of a processing element (PE) array. The PE array consists of a 16 × 16 PE that computes the absolute difference (AD) values between the Current Block and a candidate block in the search area. The Current Block will be directly loaded into the PE array from the main memory, while the candidate block will be loaded into the PE array from the local memory. Data reuse levels A and level B are implemented using a local memory. This local memory is very simple and consists only of bank of registers. Using smart control signals, this local memory can succeed to greatly reduce the memory I/O bandwidth. Once the ADs are calculated using the PE array, an adder tree is used to add all of them and calculate a final SAD. The adder tree is adding all values in parallel fashion so that the addition process will be very fast. The SAD will be compared with the calculated SAD so far, and the minimum value will be selected. The candidate block in the search area with the minimum SAD so far will be selected to be the best match candidate block and the corresponding motion vector. The demultiplexer is used to smartly feed the candidate blocks into the PE array.
4.1. Design the local memory
Implementing data reuse levels A, B, C, and D can be performed using a local memory inside the Motion Estimation co-processor. This local memory could fetch the search areas only once from the main memory. This means less memory access time is required. A smart algorithm is needed to control the data flow in/out of the local memory. The local memory in Figures 7 and 8 is designed to perform level A and level B data reuse. However, it can be used with slight modifications in its hardware and the operating algorithm to perform the other two data reuse levels (i.e., level C and level D data reuse). The modified architecture of the local memory  is shown in Figure 8. As mentioned before, this co-processor is mainly used for the Motion Estimation process of the H.264/AVC encoder.
The Current Block (CB) is loaded only once per one search area directly from the main memory into the PE array. The 16 × 16 Current Block is loaded into the PE array row by row as seen in Figure 9. The CB loading process starts from the bottom PE row and shifted in the up direction as seen in Figure 9. Each pixel of the CB is loaded into a register Rx of each PE via pin Xin as seen in Figure 8. While loading a new row of the CB into the PE array, the content of the register RX in each PE will be shifted up via the pin Xout. This process will be continued until all the CB is loaded into the PE array. Loading the CB into the PE array consumes 16 clock cycles. The local memory is used to load the PE array with the search area of size SAx × SAy (including the padding area). The local memory is very simple in its design. It consists of two banks of registers. Each bank is 16 × 16 byte registers. No addressing modes are required for such local memory. Only a simple 5-bit counter is used for addressing. There is an additional (1 × 2) demultiplexer (Demux) used to control loading the PE array by the search area.
The Motion Estimation process starts by loading the Current Block inside the PE array as mentioned before. The local memory starts loading the search area by loading the top left 16 × 16 candidate block into Register Bank #1 of the local memory as seen in Figure 8. This will be performed via connecting terminal 1 of the Demux to the main memory. Loading operation will be performed by loading the candidate block of the search area line by line into Register Bank #1. This will take additional 16 clock cycles.
In clock cycle 33, two operations will be performed at the same time. First, since the first top-left candidate block is available on Register Bank #1 of the local memory, the Motion Estimation co-processor will start loading this candidate block into the PE array column by column starting by the left column of the Register Bank #1 of the local memory. A simple 5-bit counter will be used to read such columns. The pixels of the candidate block will be loaded into the PE via pin Yin and stored into register RY (please see Figure 8). It is worth mentioning that every time a column of a candidate block is loaded into the PE array, it will be shifted to the left direction via pin Yout (as seen in Figures 8 and 9) until the PE array is filled with the candidate block. Second, the second group of the search area will be loaded into the Register Bank #2 of the local memory. This will be performed by connecting terminal 2 of the Demux to the main memory.
Once the candidate block is loaded into the PE array, the PE array will start calculating the AD between each pixel in the Current Block and the corresponding pixel in the candidate block using the adder of each PE. It is worth mentioning that level A data reuse will be performed by shifting only one column from the Register Bank #2 of the local memory into the PE array. While loading the first strip candidate blocks inside the PE array using the Register Bank #2 of the local memory, level B data reuse could be performed by loading new horizontal line into the left bank registers of local memory (Register Bank #1) via terminal 1 of the demultiplexer. Once the last candidate block of the first strip is loaded into the PE array, the counter will back to load a new candidate block (using level B data reuse) by loading the first left column of the Register Bank #1 of the local memory into the PE array. This operation will be continued until all candidate blocks are loaded into the PE array.
Every time a new candidate block is loaded into the PE array, 256 ADs will be calculated by the PEs. The adder tree will calculate the SAD value of the 256 ADs. The comparator will compare the current SAD value with the minimum SAD so far. The final minimum SAD will be decided after all candidate blocks are checked. All of these operations are performed at no stall at all and 100% utilization of the PE array. This is the main advantage of the architecture discussed in this section; this is why such architecture is suggested for real-time high-speed HDTV broadcasting applications. The final output of the co-processor will be the best candidate block with the minimum SAD value and the corresponding MV. These two values will be sent to the main processor to calculate the residue as seen in Figures 3 and 7.
This book chapter provides a solution for the huge memory access time required by the high-resolution video broadcasting operation (HDTV broadcasting). A high-speed ME architecture that greatly minimizes the memory access time, if adopted into high-resolution H.264/AVC and H.265/HEVC video standards, has been presented. The proposed architecture can perform ME process for 30 fps of HDTV video. It can easily outperforms over many fast full-search architectures due to the great reduction in the memory access time required for the ME process. The proposed architecture effectively uses pipelining, parallelism, and data reuse to achieve a high-throughput rate while maintaining 100% PE utilization. The proposed co-processor is suitable for high-performance and high-accuracy real-time video applications such as HDTV and SDTV broadcastings. As a future work, the author of the book chapter is welling to reduce the processing time required for the ME process. This can be achieved by adopting some dynamic models inside this co-processor to adaptively select the proper size of the search area. The size of the search area is decided using these model equations according to the motion activity of the Current Block. Accordingly, the computations and the memory access time required for the ME process are expected to be greatly reduced.
The author acknowledges the support of Southern University and A&M College, Baton Rouge, USA, for its support to finalize this work.