Implementation results of pipeline and P = 4 parallel DWPT (italic) and IDWPT (bold) architectures.
This work targets the challenging issue to produce high throughput and low-cost configurable architecture of Discrete wavelet transforms (DWT). More specifically, it proposes a new hardware architecture of the first and second generation of DWT using a modified multi-resolution tree. This approach is based on serializations and interleaving of data between different stages. The designed architecture is massively parallelized and sharing hardware between low-pass and high-pass filters in the wavelet transformation algorithm. Consequently, to process data in high speed and decrease hardware usage. The different steps of the post/pre-synthesis configurable algorithm are detailed in this paper. A modulization in VHDL at RTL level and implementation of the designed architecture on FPGA technology in a NexysVideo board (Artix 7 FPGA) are done in this work, where the performance, the configurability and the generic of our architecture are highly enhanced. The implementation results indicate that our proposed architectures provide a very high-speed data processing with low needed resources. As an example, with the parameters depth order equal 2, filter order equal 2, order quantization equal 5 and a parallel degree P = 16, we reach a bit rate around 3160 Mega samples per second with low used of logic elements (≈400) and logic registers (≈700).
- Mallat binary tree algorithm
- lifting scheme wavelet
- FIR filter
- parallel-pipeline architecture
- VHDL-RTL modeling
We notice in the last year a wide usage of wavelet transform theory in different domain like telecommunications, image and video processing, data compression, optical fiber, encryption and others. But these domains are evolved extremely which require a new wavelet transform architecture with low cost target technology that can provide a high-speed data processing and low power consummation. In parallel FPGA technology is massively blossomed to come very popular and to be a target technology of many applications, in particular of Discrete Wavelet Packet Transform (DWPT).
Although there are tons of research elsewhere, the talking of efficient hardware implementation of wavelet transform is still a complex mission and depend directly on the target application. Where in each application, there is a compromise between the different constraints: processing speed, implementation cost, and power consumption.
1.1 Related works
Since 1980, the crucial date of the born of “Wavelet Transform (WT)” with its founder J. Morlet, we found many works describe the hardware implementations of wavelet transforms. We note that the first work was done by Vishwanath Denk and Parhi , the authors propose an orthonormal DWT architecture combine a digit-serial processing technique with a lattice structure of quadratic mirror filter (QMF). After that, Vishwanath  and Motra , describe an efficient hardware implementation for DWT and Inverse DWT (IDWT). In 2001, Hatem et al.  worked in the reducing of the number of multipliers in the filters structure in a mixed parallel/sequential DWT architecture.
Wu and Hu  describe an implementation of DWPT/IDWPT in a strategy to minimize the number of multipliers and adders in symmetric filters using Embedded Instruction Codes (EIC). In other way to improve the data processing of DWT, Jing and Bin  implement the architecture on FPGA based on advanced distributed arithmetic (IDA), while Wu and Wang  used a multi-stage pipeline structure, although Palero et al.  work on the implementation of two-dimensional DWT architecture. Also, Hu and Jong  present two-dimensional DWT based on lifting scheme architecture that ensure a high throughput data processing.
Based on lifting scheme architecture, Fatemi and Bolouki  describe a pipeline and programmable DWPT architecture. Other important work, to optimize the hardware complexity of DWT based on coextensive distributive computation developed by Sowmya and Mathew . Paya et al.  used a classical recursive pyramid algorithm (RPA) and polyphase decomposition to develop a new architecture for IDWPT based on the lifting scheme. Acharya  developed a systolic architecture for both DWPT/IDWPT with a fixed number of requirement pages. Farahani and Eshghi  described a new DWPT implementation based on a word-serial pipeline architecture and on parallel FIR filters banks. Sarah et al.  presented a convolution block suitable for DWT decomposition. Radhakrishnana and Themozhib  developed a new DWT architecture by using XOR-MUX adders and Truncations multipliers instead of the conventional adders and multipliers. Taha et al.  developed a parallel execution to perform Lifting Wavelet transform implementation with real time, while Shaaban Ibraheem et al.  presented a high throughput parallel DWT hardware architecture based on pipelined parallel processing of direct memory access (DMA).
Also, we have to notice that we found recently some orientation to software approach to compute DWPT/IDWPT on parallel processes to increase the data processing speed with optimization of the distributed computation. But the problem is still the required computing resources (concurrent network processors or processor cores) while the energy consumption is one of the critical criteria in most application domains, for that we do not include it in our bibliography.
1.2 Wavelet theory
In the previous work, we present a detailed review of the wavelet theory. Where we focus here on:
the discrete wavelet packed transform which know as first generation of Discrete Wavelet Transform based on Mallat algorithm 
the lifting scheme approach which know as first generation of Discrete Wavelet Transform based on
1.2.1 Review of DWPT and IDWPT
From the definition of wavelet theory, the DWPT and IDWPT of a signal is set of approximation coefficients and detailed coefficients based on Mallat algorithm (or Mallat tree) and using FIR bank filter and inversely.
Based on Mallat the DWPT transform can be presented like decomposition, as shown in Figure 1.
Where the input signal is presented by the coefficients in level zero with data sampling . This amount of data (input signal) will be decomposed into two part:
High frequency signal presented by approximation coefficient with half data sampling of original signal () by using low-pass filter and down sampling by a factor of two.
Low frequency signal presented by detailed coefficient with half data sampling of original signal () by using high-pass filter and down sampling by a factor of two.
Then the data path will be following the same processing in the next level with the same filters’ characteristics. The depth in Mallat tree algorithm is equal to the number level and describe the needed filters equal to in each level. In general, the corresponding approximation and detailed coefficients in different levels in Figure 1 are calculated as follows:
Where presents the level and .
As proposed Mallat in , the corresponding transfer functions of and are derived in the following equations:
where indicates the delay for one and is the order of filters depends on the used mother wavelet.
In inverse way, the reconstruction of signal or IDWPT without loss of information is possible based on two important properties of wavelets: admissibility and regularity. Similar to decomposition way, the reconstruction operation is following an iterative method and the corresponding coefficients in different levels are calculated as follows:
For example, the reconstruction of signal in three level based always on Mallat algorithm is presented in Figure 2.
Where and are the conjugated low-pass and high-pass of and . Mallat used the quadratic mirror filter (QMF) of corresponding transfer functions and to ensure the perfect reconstruction of the original signal.
1.2.2 Review of lifting scheme discrete wavelet transform
Based on the wavelet theory, we can consider that the lifting wavelet theory is the second generation of DWPT. The strategy in this generation is to reduce the impact of the high pass and the low pass filters by replacing it into a sequence of smaller filters: update filters and predict filters. Therefore, the convolution computations are reduced by comparison to the first generation which naturally reduce the design complexity by maintaining the same quality and speed.
By definition, the lifting wavelet transform is dividing to three steps: Split, Lifting, and Scaling, as shown in Figure 3.
In the split steps, the input signal
1.3 Contributions and work organization
In this work, our goal is to develop a high performance, low cost implementation and configurable new hardware architecture of discrete wavelet transform based on Mallat algorithm : first generation (based on Discrete Packet Wavelet Transform - DWPT) and second generation (lifting scheme Discrete Wavelet Transform) by exploitation of this suitable FPGAs environment. In order to provide the low hardware cost and the high processing speed by design, we develop a new generic parallel-pipeline architecture avoiding the complexity of the traditional architectures with the massif need for hardware resource by: i) intelligent sharing of hardware computing resources (multipliers and adders) among the different filters and stages, ii) design a linear architecture to limited impact of filter and wavelet order. To improve the high performance (data processing speed and hardware cost) of our proposal, we will perform different simulation function of selected wavelet family, transformation depth, filtering order and coefficient quantization. In VHDL at the RTL level, we modeled our architectures and we synthesized it using Altera Quartus Prime Lite, targeting an Intel/Altera Cyclone-V FPGA.
This work is organized as follows: in Section 2, we introduce our linear non-parallel and P-parallel architecture of first generation for both the DWPT and IDWPT along with simulation results. In Section 3, our linear non-parallel architecture for second generation based on lifting scheme is described. Finally, conclusion is given in Section 4.
2. Hardware implementation of first generation
As shown in Figure 1, we notice that in a given stage , each filter proceeds the same amount of data and half data rate by comparison of filter in the adjacent level . The number of needed filters (low-pass and high-pass filters) in a given level is . Furthermore, the amount of procced data in each level is the same.
So, the tree architecture of Mallat have a big regularity of the behavior of filters in different levels. Which leads us to develop an ultra high speed data processing with low hardware consumption (this constraint is critical in modern application that need high throughput with low power consumption). To achieve that we think to develop an evolving architecture by retransform the exponential tree to linear one, as shown in Figure 4.
A high throughput rate with lower hardware resources are provided in this architecture by linearization of classic Mallat tree and parallelization the used transposed FIR filter. To achieve our goal by minimizing the hardware consumption, we proposed a shared computational resource (multipliers and adders) between the low-pass and high pass filters as shown in Figure 5.
In this structure, we propose a modified transposed FIR filter corresponding to blocks in Figure 4, this model is look like the serial FIR filter in the theory of FEC coding. The blocks can process in parallel P inputs sampling (signals) and consequently P outputs sampling (signals) in each clock cycle and consequently the P-parallel DWPT (Figure 4) are able to transform
Furthermore, this architecture is suitable for all wavelet family where we need just to change the coefficients of high-pass and low-pass for each family. Where the data handling (filter coefficients or signal sampling) of the low-pass and high-pass filter between different stages is dedicated to specific block in our architecture; we called it “buffers block”. The main role of “buffers block” is to interleaving data from stage to the next stage and to manage data between low-pass and high-pass filter in the same stage. Their structure is detailed in Figure 6.
To procced the same amount of data in the original Mallat binary tree b (of course multiplied by the degree of parallelize ), the buffer blocks should be working with this mechanism:
The parameter describe the stage, change from 1 to max depth of wavelet transform. The parameter present the degree of parallelism and must respect the dyadic rule, .
The structure of buffer block is based on the concept of manipulation of data speed transfer in register level, where we built up inside sub-blocks, each block has two registers/buffers level speed: “Fast Buffer” and “Slow Buffer”. On each clock cycle, the Fast Buffer take data from the output of previous stage and achieve P-shift. While, Slow Buffer sub-blocks are powered to take data from the Fast Buffer registers of the same stage and then achieve P-shift on two-clock cycles.
The size Buffer blocks (number of fast registers and slow registers) depends on two parameters: the stage presented by and the parallel degree presented by .
To manage the data path between “Slow Buffer” and “Fast Buffer”, we specified two control signal: “and “, where the signal (in green) is dedicated to control the shift rate between the different registers in Slow Buffers sub-blocks and “signal (in red) is to manage the data transfer from the fast buffer sub-block to the slow buffer sub-block. Technically these two control signals give the permission to transfer all data from the registers of the Fast Buffer to the Slow Buffer registers simultaneously in each clock cycle (in a given stage ).
The operation in the “
To centralize the architecture, we developed a control block or control unit to manage all control signals in different stage, as shown in Figure 7.
As a reverse way of P-parallel DWPT transform, this section is dedicated to present our proposed model of P-parallel IDWPT.
As we mention in the section of P-parallel DWPT transform, the reconstruction process has also a big regularity, where in Figure 2, we notice that each filter proceeds the same amount of data and half data rate by comparison to filter in the adjacent level. The number of needed filters (low-pass and high-pass filters) in a given level
We introduce the concept of linearize and serialize in our pipeline and P-parallel architecture to eliminate the impact of exponential evolution of the number of used filters. So, as shown in Figure 8, we develop a novel architecture.
In this architecture, in each stage we implement only one modified filter instead of using low pass filters and high pass filters. It is important to mention that the number modified transposed FIR filter bank increased linearly as a function of depth order which it was exponential in the classic architecture.
To achieve our goal by minimizing the hardware consumption, we develop a Blocks Filter /which is a modified reconstruction P-parallel FIR filters by shared computational resource (multipliers and adders) between the low-pass and high pass filters and with a similar structure of that present in Figure 4. The only difference is the coefficients filters values. Consequently, the P-parallel FIR filter is able to filter sampling in each clock cycle.
The data manage and interleaving between filters from the first to the end stages is dedicated to the buffer block. Their structure is detailed in Figure 9.
To ensure the data management between reconstruction high-pass and low-pass filter, we play on the timing of buffer register: slow buffer and fast buffer. To procced the same amount of data in the original Mallat binary tree b (of course multiplied by the degree of parallelize ), the buffer blocks should be working with this same mechanism as that used in the previous section.
The fast bufferachieve P-shift on each clock cycle while the “slow buffer” achieve P-shift on two-clock cycles. To manage the data follow path between “Slow Buffer” and “Fast Buffer”, we specified two control signals: “and “. The signal (in green) is dedicated to control the shift rate between the different registers in Slow Buffers sub-blocks and the “signal (in red) is to manage the data transfer from the fast buffer sub-block to the slow buffer sub-block. Also, we used a control block unit to manage the control signals. The structure of control block is similar to that present in Figure 7.
2.3 Implementation results
Following our strategy, we develop a new pipeline and P-parallel architectures for DWPT and IDWPT. These architectures are full reconfigurable at synthesis. The reconfigurable parameters are the wavelet scale or the depth of DWPT and IDWPT, the filter coefficient and data quantization, the order of modified and filters, and the degree of parallelism.
Also, these architectures are partially reconfigurable after synthesis function the value of filters coefficients (that mean implicitly the order of filters). This feature, we give the possibility to work with different wavelet family without re-synthesis the FPGA carte where we load dynamically after synthesis the filter coefficients of the corresponding wavelet.
Our aim in this part is to study the performance of these architecture to record the impact of different parameters on:
The speed of data process that mean the clock frequency (given in MHz) of implemented architecture. Where from the degree of parallelism and clock frequency, we can obtain the data sampling rate of our DWPT and IDWPT architectures.
The hardware consumption, which represented the logic registers and the logic elements .
In the following procedure of the implementation of our new architectures of DWPT and IDWPT transforms on the same FPGA split, we respect these constraints:
These architectures (pipeline and P-parallel DWPT and IDWPT) are designed and modeled in VHDL at the RTL level.
Theoretically, we do not have a limitation of parallelism degree but we should take into consideration the exist technology (hardware side) and the value must respect the dyadic rule, i. e. .
We used Altera Quartus software premium lite edition to synthesis our architectures and Intel/Altera Cyclone-V FPGA as a target technology with a speed grade of −7. For the real implementation, we used an FPGA board from Xilinx product called NexysVideo development board based on Artix-7 FPGA as a target technology.
2.3.1 Real implementation setup
To evaluate the proposed solution, a real implementation setup is depicted in Figure 10, where we used the UART connector to send and receive data from PC to NexysVideo board and inversely. Initial verification has been realized by sending the coefficients of Low-pass and High-pass filter after synthesis. Additional verification has been realized when received the reconstructed data.
|Design parameters (Depth, Filter order, and Quantification)||Clock frequency (MHz)||Resources usage ()|
|(2, 2, 5)||203.8||(471,296)|
|(3, 2, 5)||200.21||(756,510)|
|(4, 2, 5)||197.37||(1204,899)|
|(2, 4, 5)||200.87||(879,456)|
|(3, 4, 5)||185.05||(1299,719)|
|(4, 4, 5)||193.71||(1941,1171)|
|(2, 16, 5)||189.2||(3299,1416)|
|(3, 16, 5)||192.3||(4794,1924)|
|(4, 16, 5)||185.08||(6397,2614)|
|(2, 2, 16)||122.62||(2571,905)|
|(3, 2, 16)||119.79||(4216,1599)|
|(4, 2, 16)||123.14||(5850,2853)|
|(2, 4, 16)||120.56||(5038,1324)|
|(3, 4, 16)||118.57||(7521,2260)|
|(4, 4, 16)||115.33||(10,374,3636)|
|(2, 16, 16)||114.16||(4902,4402)|
|(3, 16, 16)||126.16||(6805,5729)|
|(4, 16, 16)||124.23||(9107,7752)|
|Design parameters (Depth, Filter order, and Quantification)||Clock frequency (MHz)||Resources usage ()|
|(2, 2, 5)||217.31||(1109,504)|
|(3, 2, 5)||212.45||(1699,935)|
|(4, 2, 5)||213.4||(2754,1531)|
|(2, 4, 5)||217.9||(2120,897)|
|(3, 4, 5)||202.6||(3050,1197)|
|(4, 4, 5)||206.59||(4603,2023)|
|(2, 16, 5)||201.14||(7689,2447)|
|(3, 16, 5)||202.16||(12,176,3166)|
|(4, 16, 5)||196.82||(14,956,4571)|
|(2, 2, 16)||95.77||(6079,1696)|
|(3, 2, 16)||97.8||(9279,2735)|
|(4, 2, 16)||98.82||(13,489,5011)|
|(2, 4, 16)||97.04||(12,032,2582)|
|(3, 4, 16)||94.2||(17,549,3965)|
|(4, 4, 16)||88.01||(24,311,6363)|
|(2, 16, 16)||99.15||(11,263,7856)|
|(3, 16, 16)||102.89||(14,750,11,451)|
|(4, 16, 16)||100.24||(21,314,13,091)|
|Design parameters (Depth, Filter order, and Quantification)||Clock frequency (MHz)||Resources usage ()|
|(2, 2, 5)||210.36||(3668, 652)|
|(3, 2, 5)||209.23||(6019, 960)|
|(4, 2, 5)||209.02||(8655, 1243)|
|(2, 4, 5)||181.83||(5991, 1689)|
|(3, 4, 5)||178.58||(8380, 2363)|
|(4, 4, 5)||178.37||(11,181, 2601)|
|(2, 16, 5)||169.1||(30,012, 4881)|
|(3, 16, 5)||167.28||(37,172, 6575)|
|(4, 16, 5)||167.43||(38,374, 7680)|
|(2, 2, 16)||106.07||(11,116, 3395)|
|(3, 2, 16)||105.11||(17,679, 4330)|
|(4, 2, 16)||104.98||(25,389, 4830)|
|(2, 4, 16)||91.2||(31,336, 5137)|
|(3, 4, 16)||90.55||(39,361, 7764)|
|(4, 4, 16)||91.85||(42,687, 10,342)|
|(2, 16, 16)||86.43||(26,408, 15,572)|
|(3, 16, 16)||86.57||(33,859, 20,959)|
|(4, 16,16)||85.9||(36,348, 24,456)|
Based on the results in Tables 1–3, we observed that when we increase the quantization order from 5 to 16 this increases linearly the logic and element registers and decreases logarithmically the clock frequency from around 200 MHz to around 100 MHz. As expected, the impact of depth and order of filters is too weak on the clock frequency and increases linearly the logic and element registers while it was exponential with Mallat binary tree. It is important to notice that the small latency in our architectures give us the possibility to process data in ultra high speed (in the gate of Giga-samples/clock cycle) without requiring any extra memory or DSP blocks.
It is important to notice that the incrementation of the functional frequency is directly proportional to the parallel degree. When we exceed the order of parallelism to 32, the needed resources overcome the capacity of NexysVideo board. To vanquish this problem, we suggest two possible solutions:
Under the strategy of minimizing the used hardware of Discrete Wavelet Transform, we look forward to the lifting scheme wavelet transform as a second DWPT transform generation. Section 3 is dictated to describe in details this suggested proposal.
Another possible solution is to upgrade this work with a new FPGA family like Ultra scale. This new FPGA architecture present a high-performance environment which deliver the optimal balance between the required system performance (with 783 k to 5541 k Logic cells) and the smallest power envelope. But it remains a very expensive solution.
3. Hardware implementation of second generation
Actually, we find that some new applications especially in modern wireless communication require high throughput but at the same time a low energy consumption. For this reason, we look for the second discrete wavelet generation or lifting scheme wavelet transform. Because the lifting wavelet theory is by nature require less multiplier/adder blocks and consequently low energy.
So, our aim in this section is to conserve the ultra high speed data process and also reduce the hardware conception by introducing the linearization concept in the classic lifting scheme DWPT and IDWPT tree as shown in Figures 11 and 12.
A new pipeline and linear lifting scheme DWPT and IDWPT architectures are presented in Figures 11 and 12. These new architectures ensure the data speed proceed like the classic lifting scheme transform but with less hardware not affected by the wavelet depth.
The Filter Blocks and Filter Blocks in linear lifting scheme of DWPT and IDWPT architectures, receptively, are the modified predicted and updated filter. In Figure 13, we present the structure of the modified Filter Blocks (the same for Filter Blocks, we just change the coefficients values) which can process the same amount of data (same functionality) on given stage in the classic lighting scheme tree. This Filter Blocks can process two samplings in one clock cycle.
To evaluate the performance of our architecture, a comparison section is important to prove the potential of our work and to lead us to a new innovated architecture.
In Table 4, we present a comparison between our proposed architectures and other achieved architectures of Discrete Wavelet Transform in literature. Without doubt, this table presents the potential of our linear pipeline and parallel architecture where on one hand it ensures a high frequency data processing and on the other hand a full reconfigurable structure using less hardware. Additionally, without missing an important feature, we implemented our architecture without using a memory or DSP blocks which gives us a privilege to more optimization of the used hardware in the next FPGA generation.
|Sung et al. ||Marino et al. ||Mohanty et al. ||Madis-hetty et al. ||Wang et al. ||Wu et al. ||Meihua et al. ||Proposed parallel architecture|
|N/A||N/A||426||1040||N/A||30,192 (Logic element)||1835 (Logic element)||3668 (Logic element)|
|Xilinx XC2V4000||N/A||CMOS 90 nm||Xilinx Virtex 6||CMOS 180 nm||0.35 μm||Altera EP20K200E||Altera Cyclone IV & V|
|N/A||N/A||Feq.: 20||Feq.: 306.15||Feq.: 20||Feq.: 100||Feq.: 29||Bitrate:718.13 Feq.: 197.47|
|N/A||N/A||N/A||N/A||32||N/A||Test with 5 to 16 (up to the limit of manufacturing technology)|
|3||N/A||N/A||4||3||3 (up to 6)||3||2, 3 and 4 (up to the limit of manufacturing technology)|
|No||No||No||No||No||No||No||4,8 and 16(up to the limit of manufacturing technology)|
In this work, we propose ultra-high throughput with low hardware consumption of first generation and second generation of discrete wavelet packet transform. Where title of example, from Table 3, with a quantization order = 5, depth order = 2, filter order = 2 and degree of parallelism
Based on the results in Tables 1–3, these architectures ensure high operating frequency which is low affected of wavelet depth and filters order because in our structures we maintained a short critical path of effective data path. Furthermore, these architectures are pipelined and P-parallel, modeled in VHDL at the RTL level, generic and fully reconfigurable in pre-synthesis function of the quantization of the filter coefficients and data sampling, the depth of wavelet transform, the order of the filters, and the degree of parallelism.
Last, but not least, our developed architectures are reconfigurable post-synthesis, which is not the case for most of the previous work as shown in the comparison in Table 4. Where the values of filters coefficients can be load at run-time which provides a great flexibility in experimental usage in contrary to all previous works.
This work is still in progress where we are making many simulations/verifications in different contexts to verify if the simulation results will agree or not with the implementation results. As perspectives, we work on new version of FIR filter and in parallel another work to create an IP core (Intellectual Property core) FIR to be used with different FPGA boards and in different applications. A natural way of this work is to develop a different parallel version of hardware implementation in FPGA of lifting scheme wavelet transform.