Algorithms for Efficient Computation of Convolution

Convolution is an important mathematical tool in both fields of signal and image processing. It is em-ployed in filtering, denoising, edge detection, correlation, compression, deconvolution, simulation, and in many other applications. Although the concept of convolution is not new, the efficient computation of convolution is still an open topic. As the amount of processed data is constantly increasing, there is considerable request for fast manipulation with huge data. Moreover, there is demand for fast algorithms which can exploit computational power of modern parallel architectures. The aim of this chapter is to review the algorithms and approaches for computation of convolution with regards to various properties such as signal and kernel size or kernel separability (when pro-cessing n-dimensional signals). Target architectures include superscalar and parallel processing units (namely CPU, DSP, and GPU), programmable architectures (e.g. FPGA), and distributed systems (such as grids). The structure of the chapter is designed to cover various applications with respect to the signal size, from small to large scales.


Introduction
Convolution is an important mathematical tool in both fields of signal and image processing.It is employed in filtering [1,2], denoising [3], edge detection [4,5], correlation [6], compression [7,8], deconvolution [9,10], simulation [11,12], and in many other applications.Although the concept of convolution is not new, the efficient computation of convolution is still an open topic.As the amount of processed data is constantly increasing, there is considerable request for fast manipulation with huge data.Moreover, there is demand for fast algorithms which can exploit computational power of modern parallel architectures.
The basic convolution algorithm evaluates inner product of a flipped kernel and a neighborhood of each individual sample of an input signal.Although the time complexity of the algorithms based on this approach is quadratic, i.e.O(N 2 ) [13,14], the practical implementation is very slow.This is true especially for higher-dimensional tasks, where each new dimension worsens the complexity by increasing the degree of polynomial, i.e.O(N 2k ).Thanks to its simplicity, the naïve algorithms are popular to be implemented on parallel architectures [15][16][17], yet the use of implementations is generally limited to small kernel sizes.Under some circumstances, the convolution can be computed faster than as mentioned in the text above.
In the case the higher dimensional convolution kernel is separable [18,19], it can be decomposed into several lower dimensional kernels.In this sense, a 2-D separable kernel can be split into two 1-D kernels, for example.Due to the associativity of convolution, the input signal can be convolved step by step, first with one 1-D kernel, then with the second 1-D kernel.The result equals to the convolution of the input signal with the original 2-D kernel.Gaussian, Difference of Gaussian, and Sobel are the representatives of separable kernels commonly used in signal and image processing.Respecting the time complexity, this approach keeps the higher dimensional convolution to be a polynomial of lower degree, i.e.O(kN k+1 ).On the other hand, there is a nontrivial group of algorithms that use general kernels.For example, the deconvolution or the template matching algorithms based on correlation methods typically use kernels, which cannot be characterized by special properties like separability.In this case, other convolution methods have to be used.
There also exist algorithms that can perform convolution in time O(N).In this concept, the repetitive application of convolution kernel is reduced due to the fact that neighbouring positions overlap.Hence, the convolution in each individual sample is obtained as a weighted sum of both input samples and previously computed output samples.The design of so called recursive filters [18] allows them to be implemented efficiently on streaming architectures such as FPGA.Mostly, the recursive filters are not designed from scratch.Rather the well-known 1-D filters (Gaussian, Difference of Gaussian, . . . ) are converted into their recursive form.The extenstion to higher dimension is straighforward due to their separability.Also this method has its drawbacks.The conversion of general convolution kernel into its recursive version is a nontrivial task.Moreover, the recursive filtering often suffers from inaccuracy and instability [2].
While the convolution in time domain performs an inner product in each sample, in the Fourier domain [20], it can be computed as a simple point-wise multiplication.Due to this convolution property and the fast Fourier transform the convolution can be performed in time O(N log N).This approach is known as a fast convolution [1].The main advantage of this method stems in the fact that no restrictions are imposed on the kernel.On the other hand, the excessive memory requirements make this approach not very popular.Fortunately, there exists a workaround: If a direct computation of fast convolution of larger signals or images is not realizable using common computers one can reduce the whole problem to several subtasks.In practice, this leads to splitting the signal and kernel into smaller pieces.The signal and kernel decomposition can be perfomed in two ways: • Data can be decomposed in Fourier domain using so-called decimation-in-frequency (DIF) algorithm [1,21].The division of a signal and a kernel into smaller parts also offers a straightforward way of parallelizing the whole task.
• Data can be split in time domain according to overlap-save and overlap-add scheme [22,23], respectively.Combining these two schemes with fast convolution one can receive a quasi-optimal solution that can be efficiently computed on any computer.Again, the solution naturally leads to a possible parallelization.
The aim of this chapter is to review the algorithms and approaches for computation of convolution with regards to various properties such as signal and kernel size or kernel separability (when processing k-dimensional signals).Target architectures include superscalar and parallel processing units (namely CPU, DSP, and GPU), programmable architectures (e.g.FPGA), and distributed systems (such as grids).The structure of the chapter is designed to cover various applications with respect to the signal size, from small to large scales.
In the first part, the state-of-the-art algorithms will be revised, namely (i) naïve approach, (ii) convolution with separable kernel, (iii) recursive filtering, and (iv) convolution in the frequency domain.In the second part, will be described convolution decomposition in both the spatial and the frequency domain and its implementation on a parallel architecture.

Shortcuts and symbols
In the following list you will find the most commonly used symbols in this chapter.We recommend you to go through it first to avoid some misunderstanding during reading the text.

Naïve approach
First of all, let us recall the basic definition of convolution: Respecting the fact that Eq. ( 1) is used mainly in the fields of research different from image and signal procesing we will focus on the alternative definition that the reader is likely to be more familiar with-the dicrete signals: The basic (or naïve) approach visits the individual time samples n in the input signal f .In each position, it computes inner product of current sample neighbourhood and flipped kernel g, where the size of the neighbourhood is practically equal to the size of the convolution kernel.The result of this inner product is a number which is simply stored into the position n in the output signal h.It is noteworthy that according to the definition (2), the size of output signal h is always equal or greater than the size of the input signal f .This fact is related to the boundary conditions.Let f (n) = 0 for all n < 0 ∨ n > N f and also g(n) = 0 for all n < 0 ∨ n > N g .Then computing the expression (2) at the position n = −1 likely gives non-zero value, i.e. the output signal becomes larger.It can be derived that the size of output signal h is equal to N f + N g − 1.
For the computation of f * g we need to perform N f N g multiplications.The computational complexity of this algorithm is polynomial [13], but we must keep in mind what happens when the N f and N g become larger and namely what happens when we extend the computation into higher dimensions.In the 3-D case, for example, the expression ( 2) is slightly changed: Here, f 3d , g 3d and h 3d have the similar meaning as in (2).If we assume , the complexity of our filtering will raise from g z , which is unusable for larger signals or kernels.Hence, for higher dimensional tasks the use of this approach is becomes impractical, as each dimension increases the degree of this polynomial.Although the time complexity of this algorithm is polynomial the use of this solution is advantageous only if we handle with kernels with a small support.An example of such kernels are well-known filters from signal/image processing: For better insight, let us consider the convolution of two relatively small 3-D signals 1024× 1024×100 voxels and 128×128×100 voxels-the example is shown in Fig. 1.When this convolution was performed in double precision on Intel Xeon QuadCore 2.83 GHz computer it lasted cca for 7 days if the computation was based on the basic approach. 2.0.0.2.Parallelization.
Due to its simplicity and no specific restrictions, the naïve convolution is still the most popular approach.Its computation is usually sped up by employing large computer clusters that significantly decrease the time complexity per one computer.This approach [15][16][17] assumes the availability of some computer cluster, however.

Convolution on a custom hardware
Dedicated and configurable hardware, namely digital signal processors (DSP) or Field-programmable gate array (FPGA) units are very popular in the field of signal processing for their promising computational power at both low cost and low power consumption.Although the approach based on the Fourier transform is more popular in digital signal processing for its ability to process enormously long signals, the naïve convolution with a small convolution kernel on various architectures has been also well studied in the literature, especially in the context of the 2-D and multi-dimensional convolution.
Shoup [24] proposed techniques for automatic generation of convolution pipelines for small kernels such as of 3×3 pixels.Benedetti et al. [25] proposed a multi-FPGA solution by using an external memory to store a FIFO buffer and partitioning of data among several FPGA units, allowing to increase the size of the convolution kernel.Perri et al. [26] followed the previous work by designing a fully reconfigurable FPGA-based 2-D convolution processor.The core of this processor contains four 16-bit SIMD 3×3 convolvers, allowing real-time computation of convolution of a 8-bit or 16-bit image with a 3×3 or 5×5 convolution kernel.
Recently, convolution on a custom specialized hardware, e.g.FPGA, ASIC, and DSP, is used to detect objects [27], edges [28], and other features in various real-time applications.

GPU-based convolution
From the beginning, graphics processing units (GPU) had been designed for visualisation purposes.Since the beginning of the 21st century, they started to play a role in general computations.This phenomenon is often referred to as general-purpose computing on graphics processing units (GPGPU) [29].At first, there used to be no high-level programming languages specifically designed for general computation purposes.The programmers instead had to use shading languages such as Cg, High Level Shading Language (HLSL) or OpenGL Shading Language (GLSL) [29][30][31], to utilize texture units.Recently, two programming frameworks are widely used among the GPGPU community, namely CUDA [32] and OpenCL [33].
For their ability to efficiently process 2-D and 3-D images and videos, GPUs have been utilized in various image processing applications, including those based on the convolution.Several convolution algorithms including the naïve one are included in the CUDA Computing SDK [34].The naïve convolution on the graphics hardware has been also described in [35] and included in the Nvidia Performance Primitives library [36].Specific applications, namely Canny edge detection [37,38] or real-time object detection [39] have been studied in the literature.It can be noted that the problem of computing a rank filter such as the median filter has a naïve solution similar to the one of the convolution.Examples can be found in the aforementioned CUDA SDK or in [40,41].
Basically, the convolution is a memory-bound problem [42], i.e. the ratio between the arithmetic operations and memory accesses is low.The adjacent threads process the adjacent signal samples including the common neighbourhood.Hence, they should share the data via a faster memory space, e.g.shared memory [35].To store input data, programmers can also use texture memory which is read-only but cached.Furthermore, the texture cache exhibits the 2-D locality which makes it naturally suitable especially for 2-D convolutions.

Separable convolution
The naïve algorithm is of polynomial complexity.Furthermore, with each added dimension the polynomial degree raises linearly which leads to very expensive computation of convolution in higher dimensions.Fortunately, some kernels are so called separable [18,19].
The convolution with these kernels can be simply decomposed into several lower dimensional (let us say "cheaper") convolutions.Gaussian and Sobel [4] are the representatives of such group of kernels.
Separable convolution kernel must fullfil the condition that its matrix has rank equal to one.In other words, all the rows must be linearly dependent.Why?Let us construct such a kernel.Given one row vector u = (u 1 , u 2 , u 3 , . . ., u m ) and one column vector It is clear that rank(A) = 1.Here, A is a matrix representing some separable convolution kernel while u and v are the previously referred lower dimensional (cheaper) convolution kernels.
In the previous section, we derived the complexity of naïve approach.We also explained how the complexity worsens when we increase the dimensionality of the processed data.In case the convolution kernel is separable we can split the hard problem into a sequence of several simpler problems.Let us recall the 3-D naïve convolution from (3).Assume that g 3d is separable, i.e. g 3d = g x * g y * g z .Then the expression is simplified in the following way: The complexity of such algorithm is then reduced from One should keep in mind that the kernel decomposition is usually the only one decomposition that can be performed in this task.It is based on the fact that many well-known kernels (Gaussian, Sobel) have some special properties.Nevertheless, the input signal is typically unpredictable and in higher dimensional cases it is unlikely one could separate it into individual lower-dimensional signals.

Separable convolution on various architectures
As separable filters are very popular in many applications, a number of implementations on various architectures can be found in the literature.Among the most favourite filters, the Gaussian filter is often used for pre-processing, for example in optical flow applications [43,44].Fialka et al. [45] compared the separable and the fast convolution on the graphics hardware and proved both the kernel size and separability to be the essential properties that have to be considered when choosing an appropriate implementation.They proved the separable convolution to be more efficient for kernel sizes up to tens of pixels in each dimension which is usually sufficient if the convolution is used for the pre-processing.
The implementation usually does not require particular optimizations as the separable convolution is intrinsically a sequence of 1-D basic convolutions.Programmers should nevertheless consider some tuning steps regarding the memory accesses, as mentioned in Section 2.2.For the case of a GPU implementation, this issue is discussed in [35].The GPU implementation described in the document is also included in the CUDA SDK [34].

Recursive filtering
The convolution is a process where the inner product, whose size corresponds to kernel size, is computed again and again in each individual sample.One of the vectors (kernel), that enter this operation, is always the same.It is clear that we could compute the whole inner product only in one position while the neighbouring position can be computed as a slightly modified difference with respect to the first position.Analogously, the same is valid for all the following positions.The computation of the convolution using this difference-based approach is called recursive filtering [2,18].
The well-known pure averaging filter in 1D is defined as follows: The performance of this filter worsen with the width of its support.Fortunately, there exists a recursive version of this filter with constant complexity regardless the size of its support.Such a filter is no more defined via standard convolution but using the recursive formula: The transform of standard convolution into a recursive filtering is not a simple task.There are three main issues that should be solved: 1. replication -given slow (but correctly working) non-recursive filter, find its recursive version 2. stability -the recursive formula may cause the computation to diverge

accuracy -the recursion may cause the accumulation of small errors
The transform is a quite complex task and so-called Z-transform [22] is typically employed in this process.Each recursive filter may be designed as all other filters from scratch.
In practice, the standard well-known filters are used as the bases and subsequently their recursive counterpart is found.There are two principal approaches how to do it: • analytically -the filter is step by step constructed via the math formulas [46] • numerically -the filter is derived using numerical methods [47,48]

Recursive filters on various architectures
Streaming architectures.
The recursive filtering is a popular approach especially on streaming architectures such as FPGA.The data can be processed in a stream keeping the memory requirements on a minimum level.This allows moving the computation to relatively small and cheap embedded systems.The recursive filters are thus used in various real-time applications such as edge detection [49], video filtering [50], and optical flow [51].

Parallel architectures.
As for the parallel architectures, Robelley et al. [52] presented a mathematical formulation for computing time-invariant recursive filters on general SIMD DSP architectures.Authors also discuss the speed-up factor regarding to the level of parallelism and the filter order.Among the GPU implementations, we can mention the work of Trebien and Oliveira who implemented recursive filters in CUDA for the purpose of the realistic sound synthesis and processing [53].In this case, recursive filters were computed in the frequency domain.

Fast convolution
In the previous sections, we have introduced the common approaches to compute the convolution in the time (spatial) domain.We mentioned that in some applications, one has to cope with signals of millions of samples where the computation of the convolution requires too much time.Hence, for long or multi-dimensional input signals, the popular approach is to compute the convolution in the frequency domain which is sometimes referred to as the fast convolution.As shown in [45], the fast convolution can be even more efficient than the separable version if the number of kernel samples is large enough.Although the concept of the fast Fourier transform [54] and the frequency-based convolution [55] is several decades old, with new architectures upcoming, one has to deal with new problems.For example, the efficient access to the memory was an important issue in 1970s [56] just as it is today [21,23].Another problem to be considered is the numerical precision [57].
In the following text, we will first recall the Fourier transform along with some of its important properties and the convolution theorem which provides us with a powerful tool for the convolution computation.Subsequently, we will describe the algorithm of the so-called fast Fourier transform, often simply denoted as FFT, and mention some notable implementations of the FFT.Finally, we will summarize the benefits and drawbacks of the fast convolution.

Fourier transform
The Fourier transform F = F [ f ] of a function f and the inverse Fourier transform f = F −1 [F] are defined as follows: The discrete finite equivalents of the aforementioned transforms are defined as follows: where k, n = 0, 1, . . ., N − 1.The so-called normalization factors 1 2π and 1 N , respectively, guarantee that the identity f = F −1 [F [ f ]] is maintained.The exponential function e −j(2π/N) is called the base function.For the sake of simplicity, we will refer to it as W K .If the sequence f (n), n = 0, 1, . . ., N − 1, is real, the discrete Fourier transform F(k) keeps some specific properties, in particular: This means that in the output signal F, only half of the samples are useful, the rest is redundant.As the real signals are typical for many practical applications, in most popular FT and FFT implementations, users are hence provided with special functions to handle real signals in order to save time and memory.

Convolution theorem
According to the convolution theorem, the Fourier transform convolution of two signals f and g is equal to the product of the Fourier transforms F [ f ] and F [g], respectively [58]: In the following text, we will sometimes refer to the convolution computed by applying Eq. ( 14) as the "classical" fast convolution algorithm.
In the discrete case, the same holds for periodic signals (sequences) and is sometimes referred to as the circular or cyclic convolution [22].However, in practical applications, one usually deals with non-periodic finite signals.This results into the so-called windowing problem [59], causing undesirable artefacts in the output signals-see Fig. 2. In practice, the problem is usually solved by either imposing the periodicity into the kernel, adding a so-called windowing function, or padding the kernel with zero values.One also has to consider the sizes of both the input signal and the convolution kernel which have to be equal.Generally, this is also solved by padding both the signal and the kernel with zero values.The size of both padded signals which enter the convolution is hence N = N f + N g − 1 where N f and N g is the number of signal and kernel samples, respectively.The equivalent property holds for the multi-dimensional case.The most time-demanding operation of the fast convolution approach is the Fourier transform which can be computed by the fast Fourier transform algorithm.The time complexity of the fast convolution is hence equal to the complexity of the FFT, that is O(N log N).The detailed discussion on the complexity is provided in Section 6.

Fast Fourier transform
In 1965, Cooley and Tukey [60] proposed an algorithm for fast computation of the Fourier transform.The widely-known algorithm was then improved through years and optimized for various signal lengths but the basic idea remained the same.The problem is handled in a divide-and-conquer manner by splitting the input signal into N parts 1 and processing the individual parts recursively.Without loss of generality, we will recall the idea of the FFT for N = 2 which is the simplest situation.There are two fundamental approaches to split the signal.They are called decimation in time (DIT) and decimation in frequency (DIF) [58].

Decimation in time (DIT).
Assuming that N is even, the radix-2 decimation-in-time algorithm splits the input signal f (n), n = 0, 1, . . ., N − 1 into parts f e (n ′ ) and f o (n ′ ), n ′ = 0, 1, . . ., N/2 − 1 of even and odd samples, respectively.By recursive usage of the approach, the discrete Fourier transforms F e and F o of the two parts are computed.Finally, the resulting Fourier transform F can be computed as follows: where k = 0, 1, . . ., N − 1.The signals F e and F o are of half length, however, they are periodic, hence for any k ′ = 0, 1, . . ., N/2 − 1.The algorithm is shown in Fig. 3(a). 1 The individual variants of the algorithm for a particular N are called radix-N algorithms.

Decimation in frequency (DIF).
Having the signal f of an even length N, the sequences f r and f s of the half length are created as follows: Then, the Fourier transform F r and F s fulfill the following property: F r (k ′ ) = F(2k ′ ) and F s (k ′ ) = F(2k ′ + 1) for any k ′ = 0, 1, . . ., N/2 − 1.Hence, the sequences f r and f s are then processed recursively, as shown in Fig. 3(b).It is easy to deduce the inverse equation from Eq. ( 17):

The most popular FFT implementations
On CPU.
One of the most popular FFT implementations ever is so-called Fastest Fourier Transform in the West (FFTW) [61].It is kept updated and available for download on the web page http://www.fftw.org/.According to the authors' comprehensive benchmark [62], it is still one of the fastest CPU implementations available.The top performance is achieved by using multiple CPU threads, the extended instruction sets of modern processors such as SSE/SSE2, optimized radix-N algorithms for N up to 7, optimized functions for purely real input data etc.Other popular CPU implementations can be found e.g. in the Intel libraries called Intel Integrated Performance Primitives (IPP) [63] and Intel Math Kernel Library (MKL) [64].In terms of performance, they are comparable with the FFTW.
On other architectures.
Probably the most widely-used one is the CUFFT library by Nvidia.Although it is dedicated to the Nvidia graphics cards, it is popular due to its good performance and ease of use.It also contains optimized functions for real input data.The FFT has been also implemented on various architectures, including DSP [68] and FPGA [69].

Benefits and drawbacks of the fast convolution
To summarize this section, fast convolution is the most efficient approach if both signal and kernel contain thousands of samples or more, or if the kernel is slightly smaller but non-separable.Thanks to numerous implementations, it is accessible to a wide range of users on various architectures.The main drawbacks are the windowing problem, the relatively lower numerical precision, and considerable memory requirements due to the signal padding.In the following, we will examine the memory usage in detail and propose several approaches to optimize it on modern parallel architectures.

Decomposition in the time domain
In this section, we will focus on the decomposition of the fast convolution in the time domain.
We will provide the analysis of time and space complexity.Regarding the former, we will focus on the number of additions and multiplications needed for the computation of studied algorithms.
Utilizing the convolution theorem and the fast Fourier transform the 1-D convolution of signal f and kernel g requires steps [8].Here, the term (N f + N g ) means that the processed signal f was zero padded2 to prevent the overlap effect caused by circular convolution.The kernel was modified in the same way.Another advantage of using Fourier transform stems from its separability.
Convolving two 3-D signals f 3d and g 3d , where steps in total.
Up to now, this method seems to be optimal.Before we proceed, let us look into the space complexity of this approach.If we do not take into account buffers for the input/output signals and serialize both Fourier transforms, we need space for two equally aligned Fourier signals and some negligible Fourier transform workspace.In total, it is Keeping in mind that due to the lack of available memory, direct computation of fast convolution is not realizable using common computers we will try to split the whole task into several subtasks.This means that the input signal and kernel will be split into smaller pieces, so called tiles that need not be of the same size.Hence, we will try to reduce the memory requirements while keeping the efficiency of the whole convolution process as proposed in [23].

Signal tiling
Splitting the input signal f into smaller disjoint tiles f 1 , f 2 , . . ., f m , then performing m smaller convolutions f i * g, i = 1, 2, . . ., m and finally concatenating the results together with discarding the overlaps is a well-known algorithm in digital signal processing.The implementation is commonly known as the overlap-save method [22]. 6.1.0.5.Method.
Without loss of generality we will focus on the manipulation with just one tile f i .The other tiles are processes in the same way.The tile f i is uniquely determined by its size and shift with respect to the origin of f .Its size and shift also uniquely determine the area in the output signal h where the expected result of f i * g is going to be stored.In order to guarantee that the convolution f i * g computes correctly the appropriate part of output signal h, the tile f i must be equipped with some overlap to its neighbours.The size of this overlap is equal to the size of the whole kernel g.Hence, the tile f i is extended equally on both sides and we get f ′ i .If the tile f i is the boundary one, it is padded with zero values.As the fast convolution required both the signal and the kernel of the same size the kernel g must be also extended.It is just padded with zeros which produces g ′ .As soon as f ′ i and g ′ are prepared, the convolution f ′ i * g ′ can be performed and the result is cropped to the size || f i ||.Then, all the convolutions f ′ i * g ′ , i = 1, 2, . . ., m are successively performed and the output signal h is obtained by concatenating the individual results together.A general form of the method is shown in Fig. 4(a).

Analysis of time complexity.
Let us inspect the memory requirements for this approach.As the filtered signal f is split into m pieces, the respective memory requirements are lowered to bytes.Concerning the time complexity, after splitting the signal f into m tiles, we need to perform multiplications in total.If there is no division (m = 1) we get the time complexity of the fast approach.If the division is total (m = N f ) we get even worse complexity than the basic convolution has.The higher the level of splitting is required the worse the complexity is.Therefore, we can conclude that splitting only the input signal into tiles does not help.

Kernel tiling
From the previous text, we recognize that splitting only the input signal f might be inefficient.It may even happen that the kernel g is so large that splitting of only the signal f does not reduce the memory requirements sufficiently.As the convolution belongs to commutative operators one could recommend swapping the input signal and the kernel.This may help, namely when the signal f is small and the kernel g is very large.As soon as the signal and the kernel are swapped, we can simply apply the overlap-save method.However, this approach fails when both the signal and the kernel are too large.Let us decompose the kernel g as well. 6.2.0.7.Method.
Keeping in mind that the input signal f has already been decomposed into m tiles using overlap-save method, we can focus on the manipulation with just one tile f i , i = 1, 2, . . ., m.
For the computation of convolution of the selected tile f i and the large kernel g we will employ so called overlap-add method [22].This method splits the kernel g into n disjoint (nonoverlapping) pieces g j , j = 1, 2, . . ., n.Then, it performs n cheaper convolutions f i * g j , and finally it adds the results together preserving the appropriate overruns.
Without loss of generality we will focus on the manipulation with just one kernel tile g j .Prior to the computation, the selected tile g j has to be aligned to the size || f i || + ||g j ||.It is done simply by padding g j with zeros equally on both sides.In this way, we get the tile g ′ j .The signal tile f i is also aligned to the size || f i || + ||g j ||.However, f ′ i is not padded with zeros.It is created from f i by extending its support equally on both sides.
Each kernel tile g j has its positive shift s j with respect to the origin of g.This shift is very important for further computation and cannot be omitted.Before we perform the convolution f ′ i * g ′ j we must shift the tile f ′ i within f by s j samples to the left.The reason originates from the idea of kernel decomposition and minus sign in Eq. ( 2) which causes the whole kernel to be flipped.As soon as the convolution f ′ i * g ′ j is performed, its result is cropped to the size || f i || and added to the output signal h into the position defined by overlap-save method.Finally, all the convolutions f ′ i * g ′ j , j = 1, 2, . . .n are performed to get complete result for one given tile f i .A general form of the method is shown in Fig. 4(b).
The complete computation of the convolution across all signal and kernel tiles is sketched in the Algorithm 1.
Algorithm 1. Divide-and-conquer approach applied to the convolution over large data.
( f , g) ← (input signal, kernel) f → f 1 , f 2 , . . ., f m {split ' f ' into tiles according to overlap-save scheme} g → g 1 , g 2 , . . ., g n {split 'g' into tiles according to overlap-add scheme} h ← 0 {create the output signal 'h' and fill it with zeros} for i = 1 to m do for j = 1 to n do {add h ij to h following overlap-add output rules} end for end for Output ← h 6.2.0.8.Analysis of time complexity.
Let us suppose the signal f is split into m tiles and kernel g is decomposed into n tiles.The time complexity of the fast convolution f i * g j is We have m signal tiles and n kernel tiles.In order to perform the complete convolution f * g we have to perform m×n convolutions (see the nested loops in Algorithm 1) of the individual signal and kernel tiles.In total, we have to complete n = N g ) we obtain basic convolution, i.e. the complexity class O(N f N g ).Concerning the space occupied by our convolution algorithm, we need bytes, where C is again the precision dependent constant and m, n are the levels of division of signal f and kernel g, respectively.
We currently designed an algorithm of splitting the signal f into m tiles and the kernel g into n tiles.Now we will answer the question regarding the optimal way of splitting the input signal and the kernel.As the relationship between m and n is hard to be expressed and N f and N g are constants let us define the following substitution: x = N f m and y = N g n .Here x and y stand for the sizes of the signal and the kernel tiles, respectively.Applying this substitution to Eq. ( 25) and simplifying, we get the function The plot of this function is depicted in Figure 5.The minimum of this function is reached if and only if x = y and both variables x and y are maximized, i.e. the input signal and the kernel tiles should be of the same size (equal number of samples) and they should be as large as possible.In order to reach the optimal solution, the size of the tile should be the power of small primes [70].In this sense, it is recommended to fulfill both criteria put on the tile size: the maximality (as stated above) and the capability of simple decomposition into small primes.

Extension to higher dimensions
All the previous statements are related only to a 1-D signal.Provided both signal and kernel are 3-dimensional and the tiling proces identical in all the axes, we can combine Eq. ( 20) and Eq. ( 25) in order to get: This statement can be further generalized for higher dimensions or for irregular tiling process.The proof can be simply derived from the separability of multidimensional Fourier transform, which guarantees that the time complexity of the higher dimensional Fourier transform depends on the amount of processed samples only.There is no difference in the time complexity if the higher-dimensional signal is elongated or in the shape of cube.
As the majority of recent computers are equipped with multi-core CPUs the following text will be devoted to the idea of parallelization of our approach using this architecture.Each such computer is equipped with two or more cores, however both cores share one memory.This means that execution of two or more huge convolutions concurrently may simply fail due to lack of available memory.The possible workaround is to perform one more division, i.e. signal and kernel tiles will be further split into even smaller pieces.Let p be a number that defines how many sub-pieces the signal and the kernel tiles should be split into.Let P be a number of available processors.If we execute the individual convolutions in parallel we get the overall number of multiplications and the space requirements Let us study the relationship p versus P: • p < P . . .The space complexity becomes worse than in the original non-parallelized version (26).Hence, there is no advantage of using this approach.
• p > P . . .There are no additional memory requirements.However, the signal and kernel are split into too small pieces.We have to handle large number of overlaps of tiles which will cause the time complexity (29) to become worse than in the non-parallelized case (25).
• p = P . . .The space complexity is the same as in the original approach.The time complexity is slightly better but practically it brings no advantage due to lots of memory accesses.The efficiency of this approach would be brought to evidence only if P ≫ 1.
As the standard multi-core processors are typically equipped with only 2, 4 or 8 cores, neither this approach was found to be very useful.
Regarding computer clusters the problem with one shared memory is solved as each computer has its private memory.Therefore, the total number of multiplications (see Eq. ( 25)) is modified by factor B P , where P is the number of available computers and B is a constant representing the overheads and the cost of data transmission among the individual computers.The computation becomes effective only if P > B. The memory requirements for each node remain the same as in the non-parallelized case as each computer takes care of its own private memory space.

Decomposition in the frequency domain
Just as the concept of the decomposition in the spatial (time) domain, the decomposition in the frequency domain can be used for the fast convolution algorithm, in order to (i) decrease the required amount of memory available per processing unit, (ii) employ multiple processing units without need of extensive data transfers between the processors.In the following text, we introduce the concept of the decomposition [21] along with optimization steps suitable for purely real data [71].Subsequently, we present the results on achieved on a current graphics hardware.Finally, we conclude the applications and architectures where the approach can be used.

Decomposition using the DIF algorithm
In Section 5.3, the decimation-in-frequency algorithm was recalled.The DIF can be used not only to compute FFT itself but also to decompose the fast convolution.This algorithm can be divided into several phases, namely (i) so-called decomposition into parts using Eq. ( 17), (ii) the Fourier transforms of the parts, (iii) the convolution by pointwise multiplication itself, (iv) the inverse Fourier transforms, and (v) so-called composition using Eq. ( 18).In the following paragraph, we provide the mathematical background for the individual phases.The scheme description of the algorithm is shown in Fig. 6(a).By employing Eq. ( 17), both the input signal f and g can be divided into sub-parts f r , f s and g r , g s , respectively.As the Fourier transforms F r and F s satisfy F r (k ′ ) = F(2k ′ ) and F s (k ′ ) = F(2k ′ + 1) and the equivalent property is held for G r and G s , by applying FFT on F r F s , G r , and G s , individually, we obtain two separate parts of both the signal and the kernel.Subsequently, by computing the point-wise multiplication H r = F r G r and H s = F s G s , respectively, we obtain two separate parts of the Fourier transform of the convolution h = f * g.Finally, the result h is obtained by applying Eq. ( 18) to the inverse Fourier transforms h r and h s .
In the first and the last phase, it is inevitable to store the whole input signals in the memory.Here, the memory requirements are equal to those in the classical fast convolution algorithm.However, in the phases (ii)-(iv) which are by far the most computationally extensive, the  It can be noted that the decimation-in-time (DIT) algorithm can also be used for the purpose of decomposing the convolution problem.However, its properties make it sub-efficient for practical use.Firstly, its time complexity is comparable with the one of DIF.Secondly and most important, it requires significantly more data transfers between the central and end nodes.In Section 7.5, the complexity of the individual algorithms is analysed in detail.

Optimization for purely real signals
In most practical applications, users work with purely real input signals.As described in Section 5.1, the Fourier transform is complex but satisfies specific properties when applied on such data.Therefore, it is reasonable to optimize the fast convolution algorithm in order to reduce both the time and the memory complexity.In the following paragraphs, we will describe three fundamental approaches to optimize the fast convolution of real signals.
Real-to-complex FFT.
As described in Section 5.   recommendable to overlap the data transfers with some computation phases in order to keep the implementation as efficient as possible.
To prove the importance of the overlapping, we provide a detailed analysis of the algorithm workflow.The overall computation time T required by the algorithm can be expressed as follows: where t p is the time required for the initial signal padding, t d for decomposition, t a for allocating memory and setting up FFT plans on GPU, t h→d for data transfers from CPU to GPU (host to device), t conv for the convolution including the FFT, recombination phase, point-wise multiplication, and the inverse FFT, t d→h for data transfers from GPU to CPU (device to host) and finally t c for composition.The number of end nodes (GPU cards) is denoted by P. It is evident that in accordance with the famous Amdahl's law [72], the speed-up achieved by multiple end nodes is limited to the only parallel phase of the algorithm which is the convolution itself.Now if the data are decomposed into d parts and sent to P end units and if d > P > 1, the data transfers can be overlapped with the convolution phase.This means that the real computation time is shorter than T as in Eq. ( 33).Eq. ( 33) can be hence viewed as the upper limit.The model example is shown in Fig. 8.

Algorithm comparison
In the previous text, we mentioned three approaches for the decomposition of the fast convolution: Tiling (decomposition in the time domain), the DIF-based, and the DIT-based algorithm.For fair comparison of the three, we compute the number of arithmetic operations, the number of data transfers, and the memory requirements per end node, with respect to the input signal length and the d parameter, i.e. the number of parts the data are divided into.As for the tiling method, the computation is based on Eq. ( 27) setting d = m = n (the optimum case).The results are shown in Table 1.

Memory Method # of operations # of data transfers required per node DIF
Table 1.Methods for decomposition of the fast convolution and their requirements To conclude the results, it can be noted that the tiling method is the best one in terms of memory demands.It requires 4× less memory per end node than the DIF-based and the DIT-based algorithms.On the other hand, both the number of the operations and the number of data transfers are dependent on the d parameter which is not the case of the DIF-based method.By dividing the data into more sub-parts, the memory requirements of the DIF-based algorithm decrease while the number of operations and memory transactions remain constant.Hence, the DIF-based algorithm can be generally more efficient than the tiling.

Applications and architectures
Both the tiling and the DIF-based algorithm can be used to allow the computation of the fast convolution in the applications where the convolving signals are multi-dimensional and/or contain too many samples to be handled efficiently on a single computer.We already mentioned the application of the optical microscopy data where the convolution is used to simulate the image degradation introduced by an optical system.Using the decomposition methods, the computation can be distributed over (a) a computer grid, (b) multiple CPU and GPU units where CPU is usually provided with more memory, hence it is used as a central node for the (de)composition of the data.

Conclusions
In this text, we introduce the convolution as an important tool in both signal and image processing.In the first part, we mention some of the most popular applications it is employed in and recall its mathematical definition.Subsequently, we present a number of common algorithms for an efficient computation of the convolution on various architectures.The simplest approach-so-called naïve convolution-is to perform the convolution straightly using the definition.Although it is less efficient than other algorithms, it is the most general one and is popular in some specific applications where small convolution kernels are used, such as edge or object detection.If the convolution kernel is multi-dimensional and can be expressed as a convolution of several 1-D kernels, then the naïve convolution is usually replaced by its alternative, so-called separable convolution.The lowest time complexity can be achieved by using the recursive filtering.Here, the result of the convolution at each position can be obtained by applying a few arithmetical operations to the previous result.Besides the efficiency, the advantage is that these filters are suitable for streaming architectures such as FPGA.On the other hand, this method is generally not suitable for all convolution kernels as the recursive filters are often numerically unstable and inaccurate.The last algorithm present in the chapter is the fast convolution.According to the so-called convolution theorem, the convolution can be computed in the frequency domain by a simple point-wise multiplication of the Fourier transforms of the input signals.This approach is the most suitable for long signals and kernels as it yields generally the best time complexity.However, it has non-trivial memory demands caused by the fact that the input data need to be padded.
Therefore, in the second part of the chapter, we describe two approaches to reduce the memory requirements of the fast convolution.The first one, so-called tiling is performed in the spatial (time) domain.It is the most efficient with respect to the memory requirements.However, with a higher number of sub-parts the input data are divided into, both the number of arithmetical operations and the number of potential data transfers increase.Hence, in some applications or on some architectures (such as the desktop PC with one ore multiple graphics cards) where the overhead of data transfers is critical, one can use a different approach, based on the decomposition-in-frequency (DIF) algorithm which is widely known from the concept of the fast Fourier transform.We also mention the third method based on the decomposition-in-time (DIT) algorithm.However, the DIT-based algorithm is sub-efficient from every point of view so there is no reason for it to be used instead of the DIF-based one.In the end of the chapter, we also provide a detailed analysis of (i) the number of arithmetical operations, (ii) the number of data transfers, (iii) the memory requirements for each of the three methods.
As the convolution is one of the most extensively-studied operations in the signal processing, the list of the algorithms and implementations mentioned in this chapter is not and cannot be complete.Nevertheless, we tried to include those that we consider to be the most popular and widely-used.We also believe that the decomposition tricks which are described in the second part of the chapter and are the subject of the authors' original research can help readers to improve their own applications, regardless of target architecture.
th sample of i-th Fourier transform base function and inverse Fourier transform base function, respectively • z * . . .complex conjugate of complex number z • * . . .symbol for convolution • e . . .Euler number (e ≈ 2.718) • j . . .complex unit (j 2 = −1) • f , g . . .input signal and convolution kernel, respectively • h . . .convolved signal • F, G . . .Fourier transforms of input signal f and convolution kernel g, respectively • N f , N g . . .length of input signal and convolution kernel, respectively (number of samples) • n, k . . .index of a signal in the spatial and the frequency domain, respectively • n ′ , k ′ . . .index of a signal of half length in the spatial and the frequency domain, respectively • P . . .number of processing units in use • Φ . . .computational complexity function • ||s|| . . .number of samples of a discrete signal (sequence) s

Figure 1 .
Figure 1.Example of a 3-D convolution.The images show an artificial (phantom) image of a tissue, a PSF of an optical microscope, and blurred image, computed by the convolution of the two images.Each 3-D image is represented by three 2-D views (XY, YZ, and XZ).

Figure 2 .
Figure 2. Example of the so-called windowing effect produced by signal f (a) and kernel g (b).The circular convolution causes border effects as seen in (c).The properly computed basic convolution is shown in (d).

Figure 3 .
Figure 3.The basic two radix-2 FFT algorithms: decimation-in-time and decimation-in-frequency. Demonstration on an input signal of 8 samples.
21) bytes, where (N f + N g ) is a size of one padded signal and C is a constant dependent on the required algorithm precision (single, double or long double).If the double precision is required, for example, then C = 2 • sizeof(double), which corresponds to two Fourier signals used by real-valued FFT.In the 3-D case, when || f 3d || = needed by the aligned signal is proportionally higher: (a) Overlap-save method (b) Overlap-add method

Figure 4 .
Figure 4. Using the overlap-save and overlap-add methods, the input data can be segmented into smaller blocks and convolved separately.Finally, the sub-parts are concatenated (a) or summed (b) together.

Figure 5 .
Figure 5.A graph of a function Φ(x, y) that represents the time complexity of tiled convolution.The x-axis and y-axis correspond to number of samples in signal and kernel tile, respectively.The evident minimum of function Φ(x, y) occurs in the location, where both variables (sizes of tiles) are maximized and equal at the same time.

Figure 6 .
Figure6.A scheme description of the convolution algorithm with the decomposition in the frequency domain[71].An input signal is decomposed into 2 parts by the decimation in frequency (DIF) algorithms.The parts are subsequently processed independently using the discrete Fourier transform (DFT).
memory requirements are inversely proportional to the number of parts d the signals are divided into.The algorithm is hence suitable for architectures with the star topology where the central node is relatively slow but has large memory, and the end nodes are fast but have small memory.The powerful desktop PC with one or several GPU cards is a typical example of such architecture.
4, most popular FFT implementations offer specialized functions for the FFT of purely real input data.With the classical fast convolution, users are advised to use specific functions of their preferred FFT library.With the DIF decomposition, it is nevertheless no more possible to use such functions as the decomposed signals are no more real.

Figure 7 .
Figure 7.A scheme description of the proposed algorithm for the convolution with the decomposition in the frequency domain, implemented on GPU [21].The example shows the decomposition into 4 parts.

Figure 8 .
Figure8.A model timeline of the algorithm workflow[21].The dark boxes denote data transfers between CPU and GPU while the light boxes represent convolution computations.The first row shows the single-GPU implementation.The second row depicts parallel usage of two GPUs.The data transfers are performed concurrently but through a common bus, therefore they last twice longer.For the third row, the data transfers are synchronized so that only one transfer is made at a time.In the last row, the data transfers are overlapped with the convolution execution.