The DB2 low-pass real and RNS-system number equivalent, multiplied by 211.
Using discrete wavelet transform (DWT) in high-speed signal-processing applications imposes a high degree of care to hardware resource availability, latency, and power consumption. In this chapter, the design aspects and performance of multiplierless DWT is analyzed. We presented the two key multiplierless approaches, namely the distributed arithmetic algorithm (DAA) and the residue number system (RNS). We aim to estimate the performance requirements and hardware resources for each approach, allowing for the selection of proper algorithm and implementation of multi-level DAA- and RNS-based DWT. The design has been implemented and synthesized in Xilinx Virtex 6 ML605, taking advantage of Virtex 6’s embedded block RAMs (BRAMs).
- discrete wavelet transform (DWT)
- distributed arithmetic algorithm (DAA)
- field programmable gate array (FPGA)
- residue number system (RNS)
- multiplierless implementation
The architecture of the embedded platform plays a significant role in ensuring that real-time systems meet the performance requirements. Moreover, software development suffers from increased implementation complexity and a lack of standard methodology for partitioning the implementation of signal-processing functionalities to heterogeneous hardware platforms. For instance, digital signal processor (DSP) is cheaper, consumes less power, and is easy to develop software applications, but it has a considerable latency and less throughput compared with field programmable gate arrays (FPGAs) . For high-speed signal-processing (HSP) communication systems, such as cognitive radio (CR) [2, 3] and software-defined radio (SDR) , DSP may fail to capture and process the received data due to data loss. In addition, implementing applications such as finite impulse response (FIR) filtering, discrete wavelet transform (DWT), or fast Fourier transform (FFT) by software application limits the throughput, which is not sufficient to meet the requirements of high-bandwidth and high-performance applications. As a result, HSP systems are enhanced by off-loading complex signal-processing operations to hardware platforms.
Although FPGAs exhibit an increased development time and design complexity, they are preferred to meet high-performance requirements for two reasons. First, they efficiently address signal-processing tasks that can be pipelined. Second, they have the capacity to develop a programmable circuit architecture with the flexibility of computational, memory, speed, and power requirements. However, FPGA has its own resources such as memory, configurable logic blocks (CLBs), and multipliers that influence on the performance and selected algorithm. As a consequence, the choice of algorithm is determined by the hardware resource availability and performance requirements. These factors have an impact on each other and create many challenges that need to be optimized.
As an example, the discrete wavelet transform (DWT) [5, 6, 7, 8, 9], a linear signal-processing technique that transforms a signal from the time domain to the wavelet domain , employs various techniques for signal decomposing into an orthonormal time series with different frequency bands. The signal decomposition is performed using a pyramid algorithm (PA) [10, 11] or a recursive pyramid transform (RPT) . While the PA algorithm is based on convolutions with quadrature mirror filters, which is infeasible for HW implementation, RPT decomposes the signal x[n] into two parts using high- and low-pass filters, which can be implemented using FIR filter . Figure 1 shows a four-tap FIR filter with four multipliers, named as multiplier accumulator (MAC). By using the MAC structure, multipliers are involved in multiplying an input with filter coefficients,
This work focuses exclusively on implementing a one-level multiplierless DWT for a pattern-based cognitive communication system receiver (PBCCS)  by means of FPGA. The DWT is required to extract the received signal’s features. Then, the extracted features are fed into a multilayer perceptron (MLP) neural network (NN) to identify the received symbol. The most challenging part is that the NN could consume most of the available multipliers inside the FPGA. As an example, Ntoune et al.  have implemented a real-valued time-delay neural network (RVTDNN) and real-valued recurrent neural network (RVRNN) architecture with 600 and 720 multipliers, respectively, while ML605 , ZC706 , and VC707  have 768, 900, and 2800 multipliers (DSP48Es), respectively.
Although the modern FPGAs come with a reasonable number of multipliers, designers prefer to implement multiplier-free DWT architecture for many reasons. First, a partial number of multipliers can be preserved for tasks, such as pulse shape filter, digital-up and digital-down converter that are used at SDR front-ends. Second, in contrast to DWT, the MLP weights depend on the learning step. Third, MLP weights could be frequently changed at runtime in an adaptive manner, whereas the DWT coefficients are fixed and known. Therefore, the multiplier-free DWT architecture could simplify the design process and allow the designers to focus on the MLP design.
In this work, we present the 1-D DWT implementation on FPGA by means of memory-based approaches. The aim is to compare different implementations in terms of system performance and resource consumption. We demonstrate the implementation of Daubechies wavelets (DB2, DB4, and DB5) using DAA and RNS approaches. These approaches do not employ explicit multipliers in the design. Because the main focus of this work is on extracting the key features of a signal via DWT, the inverse DWT (IDWT) and high-pass filter coefficients are not considered in this work.
1.1. Related work
Implementations of 1-D DWT for signal de-noising, feature extraction, and pattern recognition and compression can be found in [8, 9, 18, 19]. The conventional convolution-based DWT requires massive computations and consumes much area and power, which could be overcome by using the lifting-based scheme for the DWT, which is introduced by Sweldens . Although, the lifting scheme is used to compute the output of low- and high-pass using fewer components, it may not be well suited to our application, owing to the PBCCS’s nature, where the low-frequency components are much important than the higher ones. Therefore, in this study, 1-D DWT decomposition, which is implemented by means of filter banks, is considered. Another advantage of using convolution-based DWT over lifting approach is that they do not require temporary registers to store the intermediate results, and with an appropriate design strategy, they could have better area and power efficiency .
Rather than the simplest implementation of FIR filter via multipliers and an adder tree, a multiplier-free architecture is used because they result in low-complexity systems and for their high-throughput-processing capability . Fundamentally, there are two techniques for facilitating parallel processing. They are the distributed arithmetic algorithm (DAA) and the residue number system (RNS). DAA is an algorithm that performs the inner product in a bit serial with the assist of a lookup table (LUT) scheme followed by shift accumulation operations [23, 24]. Several techniques have been proposed to improve the design, such as the partial sum technique , a multiple memory bank technique [26, 27], and an LUT-less adder-based . The DAA approach has been adapted in many applications, such as least mean square (LMS) adaptive filter  and square-root-raised cosine filter .
On the other hand, RNS is an integer number system, in which the operations are performed based on the residue of division operation [31, 32, 33]. Eventually, the RNS-based results are converted back to the equivalent binary number format using a Chinese reminder theorem (CRT) . The key advantage of RNS is gained by reducing an arithmetic operation to a set of concurrent, but simple, operations. Several applications, such as digital filters, benefit from the RNS implementation, for example, [35, 36, 37]. In addition, RNS was combined with DAA in one architecture, called RNS-DA [38, 39], which benefits from the advantages of both approaches.
In this chapter, three major 1-D DWT approaches are implemented on FPGA-based platforms and compared in terms of performance and energy requirements. The implementations are compared for different number of, multipliers, memory consumptions, number of taps (N), and levels (L) of the transform to show their advantages. To the best of our knowledge, no detailed comparisons of hardware implementations of the three major 1-D DWT designs exist in the study. This comparison will give significant insight on which implementation is the most suitable for given values of relevant algorithmic parameters. Although there are many efficient designs in the study, we did not optimize the number of memories in any approach, so that we have a fair comparison.
The remainder of this chapter is organized as follows. Section 2 presents the preliminaries information to understand DWT. It also reviews the theoretical background of DAA and RNS. Section 3 describes the implementation of discrete wavelet transform. We further show an analytical comparison between these approaches. Section 4 presents the performance results. Finally, this chapter concludes in Section 5.
2. Fundamentals and basic concepts
2.1. Discrete wavelet transform
The wavelet decomposition mainly depends on the orthonormal filter banks. Figure 2 shows a two-channel wavelet structure for decomposition, where
Mathematically, a signal
The decimated low-pass-filtered output is recursively passed through identical filter banks to add the dimension of varying resolution at every stage. Eqs. (2) and (3) represent the filtering process through a digital low-pass filter
2.2. Distributed arithmetic algorithm
The distributed arithmetic algorithm (DAA) gets rid of multipliers by performing the arithmetic operations in a bit-serial computation . Because the down-sampling process follows each filter (as shown in Figure 2), Eq. (2) can be rewritten without the decimation factor as
Obviously, Eq. (4) requires an intensive operation due to multiplication of the real input values with the filter coefficients. Eq. (3) can be simplified by representing
2.3. Residue number system
The RNS is a non-weighted number system that performs parallel carry-free addition and multiplication arithmetic. In DSP applications, which require intensive computations, the carry-free propagation allows for a concurrent computation in each residue channel. The RNS moduli set,
Mapping from the RNS system to integers is performed by the Chinese reminder theorem (CRT) [34, 41, 42]. The CRT states that binary/decimal representation of a number can be obtained from its RNS if all elements of the moduli set are pairwise relatively prime.
Designing a robust RNS-based DWT requires selecting a moduli set and implementing the hardware design of residue to binary conversion. Most widely studied moduli sets are given as a power of two due to the attractive arithmetic properties of these modulo sets. For example, , , and  have been investigated.
For the purpose of illustrating, the moduli set is used for three reasons. First, the multiplicative adder (MA) is simple and identical for and . Second, for small (
3. DWT implementation methodology
3.1. DWT implementation using DA
DAA hides the explicit multiplications with a ROM lookup table. The memory stores all possible values of the inner product of a fixed
Figure 3 shows the block diagram of 1-bit DAA at position
3.2. DWT implementation using RNS
The RNS-based DWT implementation has mainly three components. They are the forward converter, the modulo adders (MAs), and the reverse converter. The forward converter, which is also known as the binary-to-residue converter (BRC), is used to convert a binary input number to residue numbers. By contrast, the reverse converter or the residue-to-binary converter (RBC) is used to obtain the result in a binary format from the residue numbers. We refer to the RNS system, which does not include RBC, as a forward converter and modular-adders (FCMA), as illustrated in Figure 4.
3.2.1. The forward converter
The forward converter is used to convert the result of multiplying an input number by a wavelet coefficient to
3.2.2. RNS-system number conversion
The received samples and wavelet coefficients span the real number and might take small values. One of the main drawbacks of RNS-number representation is that it only operates with positive integer numbers from [0,
In addition, the received sample
|Coefficient(||Real value||RNS-system value|
The multiplication of the received sample,
We propose few improvements to this design. First, instead of preserving a dedicated memory for each modulo
As with DAA-based approach, if the input word length is
According to the previous improvements, the RNS-based works are as follows. The input is divided into four segments. Each of the
To obtain the final multiplications’ result, each
3.2.4. Modulo adder (MA)
The modulo adders are required for adding the results from a modular multiplier as well as for a reverse converter. In this work, we have two MAs—that is, one is based on
3.2.5. The reverse converter
The Chinese remainder theorem (CRT)  provides the theoretical basis for converting a residue number into a natural integer. The moduli set can be efficiently implemented by four modulo adders and two multiplexers . The output of the RBC is unsigned (
As a consequence, the range of the moduli set should be greater than the maximum output,
3.3. Two-level DWT implementation
The two-level discrete wavelet transform compromises two one-level DWTs, where the output of the first level is fed into the second level (as shown in Figure 7). The one-level DWT at each level is identical, but the output of each level is halved. For example, if a signal of 1800 samples is applied to the input, then 900 and 450 samples are produced by the first and second levels, respectively.
Figure 7 shows the design of two-level RNS-based DWT, which involves two FCMA blocks and two RBC blocks. Each FCMA requires converting the result of the first stage to binary, shifting the number by 11 and converting it to residue number again.
3.4. Hardware complexity
3.4.1. Memory usage
DAA and RNS techniques employ the memory as a key resource to avoid multiplying two input variables. In each approach, as the number of filter taps increases, both the size and the number of memories change. Assuming that the length of the received word is
The total number of memory elements that are occupied by the DAA-based filter is
On the other hand, the total number of memory elements occupied by an RNS-based filter is . This equation shows that the number of memory elements increases linearly with the number of filter taps, while the memory size remains constant
|Number of filter taps||4||8||10|
|DA memory usage||22*(4 × 22)||22*(8 × 22)||22*(10 × 22)|
|RNS memory usage||16*(4 × 22)||32*(4 × 22)||40*(4 × 22)|
3.4.2. Shift register and adder counts
DAA-based implementation employs shift registers and adders to sum the result at each bit level (Figure 3). For a word length
On the other hand, for a word length
For instance, three-channel DB2 implementation requires nine MA blocks to sum the result, and
Meanwhile, the number of RNS-based adders depends on the design of the MA block. For example, each MA block of (
4. Performance analysis and validation
Hardware analysis was carried out by using a Xilinx System Generator for DSP (SysGen) , which is a high-level software tool that enables the use of MATLAB/Simulink environment to create and verify hardware designs for Xilinx FPGAs. It enables the use of the MathWorks model-based Simulink design environment for FPGA design. Furthermore, the hardware-software co-simulation design was synthesized and implemented on ML605 Xilinx Vertex 6 .
The implementation of RNS and DA is compared with the multiplier-accumulate-based DWT structure (MAC), as shown in Figure 8. We also consider the direct DWT implementation using an IP FIR Compiler 6.3 (FIR6.3) block , which provides a common interface to generate highly area-efficient and high-performance FIR filters (Figure 9).
For RNS implementation, the moduli sets of
4.1. Resource utilization and system performance
Table 4 summarizes the resource use by RNS-based components—that is, FCMA and reverse converter (RBC). The RBC consumes fewer resources and less power. However, the operating frequency is equal in all models and greater than the entire RNS-based filter.
|Resources||RNS-based (n = 7)||RNS-based (n = 10)|
|Number of slice registers||656||157||883||190|
|Number of slice LUTs||591||138||854||180|
|Number of RAMB18E1||16||0||16||0|
|Max. operating freq. (MHz)||291.2||311.62||283.85||298.67|
|Min. period (ns)||3.434||3.21||3.523||3.348|
|Estimated total power (mW)||40.5||6.59||43.08||7.33|
|Latency (clock cycle (CC))||6||6||6||6|
Table 5 summarizes the resource consumption of each filter implementation. It shows that the MAC and IP FIR-based implementations have four multiplier units (DSP48E1s) with maximum frequencies of
|Resources||MAC||DA||FIR||RNS (||RNS (|
|Number of slice registers||282||661||167||767||1089|
|Number of slice LUTs||128||520||71||721||1055|
|Number of occupied slices||58||188||60||240||358|
|Number of DSP48E1s||4||0||4||0||0|
|Number of RAMB18E1||0||22||0||16||16|
|Max. operating freq. (MHz)||296.38||229.83||472.59||258.86||261.028|
|Min. period (ns)||3.374||4.351||2.030||3.863||3.831|
|Estimated total power (mW)||8.44||66.54||9.05||56.22||53.05|
Table 6 shows a comparison between the DA- and RNS-based one-level DWT implementations when using larger filter banks—that is, DB4 and DB5. It shows that DAA-based implementation occupies a fixed number of RAMB18E1. The number of memory elements of DAA-based implementation is fixed and depends on the word length (Table 2).
|Resources||DA-based||RNS (n = 7)|
|Number of slice registers||650||737||780||767||1441||1898|
|Number of slice LUTs||521||539||568||721||132||1677|
|Number of RAMB18E1||22||22||22||16||32||40|
|Max. operating freq. (MHz)||232.72||205.55||223.31||258.87||265.32||258.80|
However, as the number of filter taps increases, the memory size is exponentially increased to
4.2. Functionality verification
The discrete wavelet transform was simulated by means of ModelSim simulator. Figure 10 shows that the MAC and DAA have lower latency than other approaches. It depicts that the FIR- and RNS-based of
4.3. Precision analysis
We carried out the precision analysis for the first and DWT levels, and the result is presented in Table 7. The output bit precision is set to Q5,16 for all implementations. Table 7 shows the maximum performance based on the signal-to-noise-ratio (SNR) and peak-signal-to-noise-ratio (PSNR). For P7, we could not achieve a better accuracy with the specified scaling factors because
|Internal word length||22 bit||22 bit||NA||22 bit||31 bit|
Hardware availability and system performance requirements are critical for selecting the appropriate architecture of the embedded platform. The number of DWT levels, filter taps, and word length has a substantial influence on the performance of the design and complexity.
Increasing the number of DWT levels has roughly the same effect on the operating frequency. Because the only change between the RNS-based with
Another critical parameter that affects the DWT performance is the filter order. DAA-based implementation outperforms the RNS-based with at most
In addition, the two approaches have different memory content. Whereas the memory content of DAA-based implementation is consistent and identical, the memory content of RNS-based varies from tap to tap. This is obvious because each memory 590 stores the multiplication values of each filter coefficient by the moduli set.
The word length determines the number of occupied memory in both implementations. As the word length increases, the number of memory within the DAA- and RNS-based approaches increases linearly by
In this chapter, we addressed the effect of multiplierless DWT implementations, which have a substantial impact on the overall performance of the design and resource availability. We presented DAA- and RNS-based implementations of DWT and compared them with the MAC-based approach. The former approaches are multiplierless architectures that intensively use memory to speed up the entire processing time.
Given implementation examples for experimental verifications and analysis, the approaches were simulated using Simulink and validated on a Xilinx Virtex 6 FPGA platform. The co-simulation results have also been verified and compared with the simulation environment. The complexity and optimization of multi-level DWT with respect to hardware structure provides a foundation for employing an appropriate algorithm for high-performance applications, such as in cognitive communication when combining the DWT analysis with machine-learning algorithms.