Open access peer-reviewed chapter - ONLINE FIRST

Performance Analysis of OpenCL and CUDA Programming Models for the High Efficiency Video Coding

By Randa Khemiri, Soulef Bouaafia, Asma Bahba, Maha Nasr and Fatma Ezahra Sayadi

Submitted: May 7th 2021Reviewed: August 6th 2021Published: October 19th 2021

DOI: 10.5772/intechopen.99823

Downloaded: 40


In Motion estimation (ME), the block matching algorithms have a great potential of parallelism. This process of the best match is performed by computing the similarity for each block position inside the search area, using a similarity metric, such as Sum of Absolute Differences (SAD). It is used in the various steps of motion estimation algorithms. Moreover, it can be parallelized using Graphics Processing Unit (GPU) since the computation algorithm of each block pixels is similar, thus offering better results. In this work a fixed OpenCL code was performed firstly on several architectures as CPU and GPU, secondly a parallel GPU-implementation was proposed with CUDA and OpenCL for the SAD process using block of sizes from 4x4 to 64x64. A comparative study established between execution time on GPU on the same video sequence. The experimental results indicated that GPU OpenCL execution time was better than that of CUDA times with performance ratio that reached the double.


  • HEVC
  • ME
  • SAD
  • GPU
  • CUDA
  • OpenCL

1. Introduction

The Graphics Processing Unit (GPU) [1] is a microprocessor present on graphic cards or game consoles. It has a strong parallel framework initially dedicated to accelerating graphics tasks. Having this innovation and programming language General Purpose computation on GPUs (GPGPU) languages such as Compute Unified Device Architecture (CUDA) [2] and Open Computing Language (OpenCL) [3] enabled applications development in many domains.

CUDA is an NVIDIA Corporation programming model that runs only on NVIDIA GPUs. The OpenCL method, an effort of the Khronos Community, is very close to the CUDA method. However, this is a requirement open for parallel programming on various platforms: CPUs, GPUs, Digital Signal Processors (DSPs) and other types of processors. Taking into account that, OpenCL is able to manage several devices. The concept of context makes it possible to deal with this problem. A context designates a set of devices.

However, there are two major differences. The first difference is that OpenCL codes are much larger than CUDA C codes. The multiplatform side of OpenCL explains this. The second difference is that the kernel is built from the host code during runtime using the OpenCL runtime library [4].The OpenCL kernel can be used in two ways, expressly defining the working group’s local and global size and the local size or indirectly leaving OpenCL to select its global size of working group. The size of a working group equals a CUDA thread size block, the size of a working group is also known as ND Range configuration, as seen in Figure 1.

Figure 1.

Model of software programming.

The two languages provide similar hierarchical decomposition of the computation index space explained on Table 1. The synchronization is available on thread block/ work-group level only.

Thread BlockWork group
ThreadWork item
Thread IDGlobal ID
Block indexBlock ID

Table 1.

Execution model terminology mapping.

This paper proposed an implementation of the Sum of Absolute Differences (SAD) of the High Efficiency Video Coding (HEVC) Motion Estimation (ME) algorithm on an NVIDIA GPU using CUDA and OpenCL languages to compare their performances.

This manuscript is structured as follows: Section 2 introduces the HEVC SAD algorithm. In Section 3, an overview of ME is given. Section 4 gives and describes the SAD kernel proposed. In Section 4 the experimental results and the discussion are given. Finally, Section 5 concludes this paper.


2. HEVC ME feature

The key element of HEVC is the ME, which represent the most time-consuming task in video coding. Actually, the complexity of ME increases significantly due to the increase in the coding block size [5]. Inter-prediction requires a great complexity burden of up to 80% [6] in the total encoding process, due to the ME, which consumes around 70% of the inter-prediction time, as mentioned Figure 2 [6].

Figure 2.

HEVC inter-prediction time distribution [6].

ME is performed on a block-by-block basis and supports variable block sizes in HEVC. This coding tree unit (CTU) structure, which offers a compromise between a good quality and a less bit-rate, is based on three new concepts: coding unit (CU), prediction unit (PU), and transform unit (TU) [7, 8].

Each picture is divided into CTU of size 64 × 64 pixels, which can be partitioned after that into 4 CUs [9] sized from 8 × 8 to 64 × 64 pixels. These regions of CU contain one or several PUs and TUs.

In the HEVC ME algorithm, SAD and SSD are the most requested functions. These several cost functions are used to decide the best coding mode and its associated parameters. An idea of the SAD is given in the next subsection.


3. HEVC SAD algorithm

The calculation of the Sum of Absolute Difference (SAD) is commonly used for motion estimation in video coding. This is usually the computational intensive part of video processing [10, 11]. It computes the difference between the pixel intensity of the current and reference frame macro block. The motion compensation block size is N × N, where, Currenti,j, and Referencei,jare current and reference frame block [12].


SAD is also used as an error calculation in order to define the similar block and to evaluate the motion vector in the motion estimation phase [13]. SAD is a simple and fast evaluation metric. This calculation takes every pixel in a block into an account. For many motion estimation algorithms, it is therefore very efficient (Figure 3).

Figure 3.

Block matching algorithm based on SAD.


4. Proposed SAD kernel

The calculation of the SAD can be parallelized using GUP since it treats each pixel separately, which corresponds to the architecture of the graphics processors 2D-grid of threads blocks which computes all disparities for 2D blocks of the image. Each thread computes the SAD value for a block in the search range, and a thread block calculates the entire SAD value for an image block. The benefit is that all SADs are calculated in the same thread block for an image block.

In [14] the authors implemented the SAD on the general purpose GPU architecture. A significant acceleration of 204x for an image size of 1024 × 768 was obtained for SAD on the GeeForce GTX 280 compared to the serial implementation as shown in Figure 4.

Figure 4.

Typical mapping of a block-matching algorithm to a GPU.

The SAD kernel is composed of two main steps. The subtraction of the PU pixels then the summation. The addition was achieved on the GPU with the parallel reduction. In step1, the first N/2 elements are added to the other N/2. In the result, in the step 2, we have N/2 elements to add up. The first half was added to the next half. The same steps are repeated until there is only one number remaining as shown in Figure 5 [15].

Figure 5.

Reduction technique.


5. Experimental results

5.1 OpenCL performance on GPU compared the CPU one

OpenCL offers a convenient way to construct heterogeneous computing systems and opportunities to improve parallel application performance. As first step, the OpenCL SAD kernel was implemented in two platforms: CPU with 4 cores at frequency 2.5 GHz and an NVDIA GPU 920 m of 954 MHz as frequency. The SAD block dimensions are from 4 × 8 to 64 × 64 pixels. A comparative analysis was made on the same video between the CPU and GPU is seen in Figure 6. It is clear from the next figure that the GPU execution time is greater than CPU execution (Figure 7) [16].

Figure 6.

Performance OpenCL comparison with GPU and CPU platforms.

Figure 7.

Speed-up using OpenCL language.

When using the Eq. (2), Figure 6 indicates the speed up [17] of the both implementations.

Speedup=CPUexecution time/GPUexecution timeE2

The speed up shows that the GPU platform is more efficient than the CPU platform, and this is due to the efficient parallel architecture of GPU compared to CPU. To validate the OpenCL code compared to the CUDA code the next study is proposed.

5.2 Execution performance OpenCL GPU compared to CUDA GPU

Running the application through GPU requires these steps as it is shown in Figure 8. For OpenCL, approach contains GPU detection and kernel compilation. The CPU input is read from the host to the device by all frameworks; the kernel is executed on the GPU; the device is returned to the host by copy data. Finally, the results are displayed on CPU.

Figure 8.

Algorithm flow.

Table 2 reports the kernel running time for different size of Prediction Unit (PU) (designed the block size used). In order to get repeated average times, we fixed each problem 10 times for both CUDA and OpenCL.

Block sizesGPU execution time (μs)
CUDA languageOpenCL language
CU 8 × 8PU 4 × 86.635.592
PU 8 × 46.8855.482
PU 8 × 87.5735.831
CU 16 × 16PU 8 × 168.2246.013
PU 16 × 88.4025.909
PU 16 × 168.9926.356
CU 32 × 32PU 16 × 3210.176.451
PU 32 × 169.0376.503
PU 32 × 329.7296.877
CU 64 × 64PU 32 × 6410.2657.964
PU 64 × 3213.17.297
PU 64 × 6413.6878.614

Table 2.

GPU and CPU application running times in seconds.

We use a normalized performance metric, called Performance (PR), to compare the performance of CUDA and OpenCL (Figure 9).

Figure 9.

Performance ratio.

PR=CUDAexecution time/OpenCLexecution timeE3

If performance ratio is greater than 1, OpenCL will give a better results compared to CUDA language. As shown in Figure 8, the performance ratio indicates that the OpenCL kernel running time is better than CUDA kernel running for each size block. Similar results are obtained by Frang et al. [18] and Exterman [19], respectively.

5.3 Comparative study

In this section, we compared the time performance of our proposed implementation to State-of-the-Art process [21, 22].

In the work presented by Xiao et al. [21], when comparing the result of the proposed with the HEVC reference software, experimental results show that the proposed GPU implementation achieves 34.4% encoding time reduction on average while the BD-rate increase is only about 2% for a typical low delay setting. Another interesting work is proposed by Karimi et al. [22] used a specific real-world application to compare the performance of CUDA with NVIDIA’s implementation of OpenCL. Contrary to our results, CUDA’s kernel execution was here consistently faster than OpenCl’s, despite the two implementations running nearly identical code. CUDA seems to be a better choice for applications where achieving as high a performance as possible is important. Otherwise the choice between OpenCL and CUDA can be made by considering factors such as prior familiarity with either system, or available development tools for the target GPU hardware. The performance will be dependent on some variables, including code quality, algorithm type and hardware type.


6. Conclusion

OpenCL is quite competitive with CUDA on the NVIDIA graphics processor in terms of performance. In this work, the use of OpenCL as a portable language for the development of GPGPU applications was studied. SAD is the largest part of runtime and calculation in motion estimation the reduction technique was used to implement the SAD, which significantly allows reducing the run time. The performance ratio was equals to 2 when comparing the OpenCL implementation to the CUDA one.

Paralleling multiple GPU algorithms could improve performance. In addition to the ME algorithm of the Joint Collaborative Video Coding Team (JCT-VC) [20], we assume that the suggested concept can also be applied.


Conflict of interest

The authors declare no competing interests.


chapter PDF

© 2021 The Author(s). Licensee IntechOpen. This chapter is distributed under the terms of the Creative Commons Attribution 3.0 License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

How to cite and reference

Link to this chapter Copy to clipboard

Cite this chapter Copy to clipboard

Randa Khemiri, Soulef Bouaafia, Asma Bahba, Maha Nasr and Fatma Ezahra Sayadi (October 19th 2021). Performance Analysis of OpenCL and CUDA Programming Models for the High Efficiency Video Coding [Online First], IntechOpen, DOI: 10.5772/intechopen.99823. Available from:

chapter statistics

40total chapter downloads

More statistics for editors and authors

Login to your personal dashboard for more detailed statistics on your publications.

Access personal reporting

We are IntechOpen, the world's leading publisher of Open Access books. Built by scientists, for scientists. Our readership spans scientists, professors, researchers, librarians, and students, as well as business professionals. We share our knowledge and peer-reveiwed research papers with libraries, scientific and engineering societies, and also work with corporate R&D departments and government entities.

More About Us