The parameters of lifting scheme for db4 (8 taps)

## 1. Introduction

The discrete wavelet packet transform (DWPT) as a generalization of the standard wavelet transform provides a more flexible choices for time–frequency (time-scale) representation of signals [1] in many applications, such as the design of cost-effective real-time multimedia systems and high quality audio transmission and storage. In parallel to the definition of the ISO/MPEG standards, several audio coding algorithms have been proposed that use the DWPT, in particular, adaptive wavelet packet transform, as the tool to decompose the signal [2],[3]. In practice, DWPT are often implemented using a tree-structured filter bank [2], [3],[4]. The DWPT is a set of transformations that admits any type of tree-structured filter bank, that provides a different time–frequency tiling map. Many architectures have been proposed for computing the discreate wavlet transform in the past. However, it is not the case for the DWPT. There are very few papers regarding the development of specific architectures for the DWPT. In [5] is designed a programmable DWPT processor using two-buffer memory system and a single multiplier–accumulator (MAC) to calculate different subbands. Method [6] exploits the in-place nature of the DWPT algorithm and uses a single processing element consisting of multipliers in parallel and adders for each low-pass and high-pass filters – wavelet butterflies (is the number of filter taps) to increase the throughput. In [7] is also proposed a folded pipelined architecture to speed up the throughput. It consists of MACs communicated by memory banks to compute each level of the total decomposition levels.

Applying the lifting scheme [8] for the construction of wavelets filter bank allows significantly reduce the number of arithmetic operations that are necessary to compute the transform. A folded parallel architecture for lifting-based DWPT was presented in [9]. It consists of a group of MACs operating in parallel on the data prestored in a memory bank. In [10] is proposed an architecture using a direct implementation of a lifting-based wavelet filter to perform one level of DWPT at a time. The main drawback of these existing architectures is that they all use memory to store the intermediate coefficients and involve intense memory access during the computation. A recursive pyramid algorithm based folded architecture for computing lifting-based multilevel DWPT is presented in [11]. However, the scheduling and control complexity is high, which also introduce large numbers of switches, multiplexers and control signals. The architecture is not regular and need to be modified for different number of level of DWPT computation. A folded architecture for lifting-based wavelet filters is proposed in [12] to compute the wavelet butterflies in different groups simultaneously at each decomposition level. According to the comparison results, the proposed architecture is more efficient than the previous proposed architectures in terms of memory access, hardware regularity and simplicity, and throughput. It is necessary to notice, that the architecture of the given processor is effective only for calculation of full tree DWPT. Here there is no technique of management for DWPT with best tree searching.

Algorithm transformation techniques have been employed in high-speed DSP system design is presented in [13]. All of the above mentioned techniques are applied during the processor design phase and their implementation is time invariant. Therefore, this class of signal processing techniques is referred as static techniques. Recently, dynamic techniques both of the circuit level and algorithmic level have been proposed [14]. These techniques are based on the principles that the input signal is usually non-stationary, and hence, it is better (from a coding perspective) to adapt the algorithm and architecture to the input signal. Such systems are referred to as reconfigurable signal processing systems [15],[16]. The key goal of these techniques is to improve the algorithm performance by exploiting variability in the data and channel.

Our approach is to design of dynamic algorithm transform (DAT) for design of application-specific reconfigurable lifting-based DWPT pipeline processor, in particular, for audio signal processing in real-time. The principle behind DAT techniques is to define parameter of input audio signals (subband entropy) and output encoded sequences (subband rate) for the given embedded processor architecture. Adaptive wavelet analysis for audio signal processing purposes is particularly interesting if the psychoacoustic information is considered in the DWPT decomposition scale. Due to the lack of selectivity of wavelet filter banks, the psychoacoustic information is computed in the wavelet domain.

## 2. Flexible tree structured signal expansion based on DWPT

DWPT algorithm is a generalization of the discrete wavelet transform that can be represented as a filter bank with a tree structure [3] (see figure 1). Within a given node number

Thus, a specific node

## 3. Dynamic transformation of DWPT decomposition

We present adaptive DWPT tree derived via DAT’s. The principle behind DAT is to define parameter of input signals (subband entropy) and output sequences (subband rate) for the given embedded processor architecture. In other hands, DAT techniques is to construct a minimum cost subband decomposition of DWPT by maximizing the minimum masking threshold (which is limited by the perceptual entropy (

where

here

The growing decision for DWPT tree based on the given

If (3) is true we continue the subband splitting process in DWPT tree, otherwise the suboptimal decomposition for the given frame of signal is founded.

The subband splitting process is managed based on the estimated values of

where

Each allowed parent node

Schematically the DWPT tree growing process is shown in the figure 3. The example of dynamic DWPT tree structure growing level by level based on

Applying the information density

## 4. DWPT implementation based on lifting scheme

### 4.1. Factoring wavelet filters in to lifting steps

In the tree-based scheme of the DWPT, each node of the tree consists of a two-channel filter bank. Each node can be broken down into a finite sequence of simple filtering steps, which are called lifting steps or ladder structures. In [20] for two-channel filter bank proposed a method of transition from the implementation on the basis of FIR filters to architecture at the based on lifting scheme. The decomposition is essentially a factorization of the polyphase matrix of the wavelet filters into elementary matrices. As discussed in [20], the lifting steps scheme consists of three phases: the first step splits the data into two subsets: even and odd; the second step recalculates the coefficients (high-pass) as the failure to predict the odd set based on the even; finally the third step updates the even set using the wavelet coefficients to compute the scaling function coefficients (low-pass). This method allows to reduce on halve the number of multiplications and summations. In terms of the

The first step is to move towards the implementation of polyphase filtering algorithm [20]. The process of calculating the LF and HF components of the signal

where

The elements

This approach does not give the gain to the computational cost, but the hardware implementation allows us to reduce the operation frequency in double in compare with input data rate due to parallel computation (see figure 5).

3.1029 | -5.1995 | 0.3141 | 0.0763 | -3.1769 | |

0 | 1.6625 | 0 | -0.2920 | -0.0379 | |

0 | -1 | -3 | 1 | 3 |

The second step is a factorization of the polyphase matrix into simpler triangular matrices. The result is, in a general, the original matrix

where

The scaling

where

The inverse decomposition of a two-channel filter bank in the same terms can be expressed as

and corresponding block diagram of two-channel synthesis filter bank based on the lifting scheme shown in a figure 7.

According to the block diagram (see figure 7), the synthesis procedure is implemented as follows: at first, the input coefficients

Together the analysis and synthesis filter bank implementation based on lifting scheme is required the same number of operations (summation and multiplication) in compare with analysis only implementation based on regular FIR filter implementation.

### 4.2. The algorithm implementation based on fixed-point variable format arithmetic

A number of the target application requirements (work in real time, greater throughput, and other) make it necessary to use fixed-point arithmetic to perform the specified computation. With the implementation of two-channel filter bank based on lifting structures using integer arithmetic, the number of difficulties arise, related to the fact that values of the coefficients of the polynomials

In accordance with this approach, any number represented in two's complement fixed-point format, is given in the form of expression:

Here,

For a given in (13) format, the operations of addition and multiplication of a and b (

The figure 9 schematically illustrates the process of performing operations described above in (14) and (15).

In this paper a generic set of processing elements is proposed to implement of analysis and synthesis banks on the lifting structures using variable arithmetic format (see table 2). In this table

Symbol | Computational operations | Symbol | Computational operations |

For an explanation of the practical aspects related to the use of the proposed arithmetic, below an example of the two-channel analysis bank realization on the mother wavelet function db4 [23] is considered. In result of polyphase matrix factorization for a given wavelet basis in the MATLAB environment the following expression was obtained:

The coefficients

1 | 0 | -3,1029 | -0,7757 | 2 | 0 | 0 | - | |

2 | 1 | -0,0763 | -0,6104 | -3 | 0,2920 | 0,5840 | -1 | |

3 | -1 | 5,1995 | 0,6499 | 3 | -1,6625 | -0,8313 | 1 | |

4 | 3 | 3,1769 | 0,7942 | 2 | 0,0379 | 0,6064 | -4 | |

5 | -3 | 0,3141 | 0,6282 | -1 | 0 | 0 | - |

In the figure 10 shown a block diagram of the implementation of two-channel filter bank analysis for this example. In this scheme, apart from computing elements

In the figure 11a in more detail the first step realization of the analysis bank is considered and in the figure 11b the last step realization of synthesis bank is shown.

As can be seen from the figure 10 and figure 11, the computing units of analysis and synthesis procedures in terms of implementation differ only in the signs of constant multiplying coefficients and the arithmetic shifts positions and directions.

Based on the materials described above, a concrete realization of two-channel bank can be represented as a vector of parameters containing a set of multiplier constants, shift parameters and some additional information regarding the delay elements in the intermediate nodes of the algorithm.

### 4.3. Accuracy analysis of the algorithm for fixed-point variable format

To analyse of the proposed approach in MATLAB function library was written, which simulates the process of fixed-point calculating in filter bank with specified structure of the tree. In the figure 12 is shown the estimation error variance signal recovery, depending on the choice of bit internal registers as a result of passing through the two-channel filter bank analysis / synthesis (in the example used wavelet filters db8). This figure also shows the results of an experiment using FIR filters, the underlying of the algorithm DWPT. It can be noted that FIR filter implementation gives better results while using the same registers word length, but requires twice as many calculations. So in order to achieve the level of error variance in the -70 dB for based on lifting structures implementation requires 16-bit, which are approximately two bits more than the realization based on FIR. But this drawback is compensated by a significant reduction of arithmetic operations compared to the direct implementation. Thus, we conclude that the proposed approach is more efficient in hardware implementation compared with the bank on the basis of FIR filters.

Below we consider another experiment for demonstration of the energy localization properties by using our fixed-point DWPT algorithm realization. For this a polyharmonic signal was generated and passed through a five-level decomposition tree fast wavelet transform (division of the tree is carried out only in the low-frequency components). As an example, the wavelet functions db2 family was chosen. Thus all range of amplitudes of the wavelet coefficients was divided by 40 thresholds (these values are plotted on the X-axis of figure 13). Each threshold has been mapped to a vector of the obtained analysis wavelet coefficients on condition that these coefficients are greater than this threshold. Otherwise, the values of the coefficients were replaced by zeros (i.e. in each vector were discarded unimportant relative to a given threshold values). For all vectors was performed reconstruction procedure by synthesis filter bank. In figure 13a, figure 13b for the floating-and fixed-point implementations, respectively, are shown: solid line – the reconstructed to original signal energy relation (in percentage) depending on the threshold values; dotted line – the percentage of "discarded" wavelet coefficients depending on the chosen threshold.

Based on these results, we note almost complete compliance of floating-point model with the proposed fixed-point approach. Thus, the fixed-point variable format algorithm DWPT implementation preserves the energy localization inherent to the wavelet packets.

## 5. DWPT pipeline processor with dynamic reconfigurable architecture

### 5.1. DAT based reconfigurable signal processing system

The structure of reconfigurable DSP system for signal analysis based on DAT approach consists of the specific microprocessor oriented on the signal processing (DSP microprocessor) and DWPT processor itself with the reconfigurable architecture. The DSP microprocessor perform several task, such as: processing wavelet coefficients

The pipeline architecture is applied for effective implementation of the DWPT algorithm. We suggest the pipeline architecture for constructing the DWPT lifting based processor. This architecture integrates the sequential connection of the homogeneous block (buffer/switch unit (BSU) and processing unit (PU)) that implement a two-channel filter bank that allows parallel calculating DWPT with an arbitrary tree structure. The maximum number of decomposition level that can be realize is 8, it is associate with the depth of

### 5.2. DWPT lifting based pipeline Processing Unit (PU)

The block diagram of PU is shown in the figure 15. The input sequence

+2 | |

N | |

+1 | |

1 |

### 5.3. Buffer/Switch Unit (BSU)

The BSU realizes double buffering scheme known as “ping-pong” for providing parallel access to the data for storing results and getting source data from/for PU. The additional channel is for outputting the result data. The two output streams of samples

The momory amount

where * K*is a initial frame length of input signal.

In the figure 17 is shown an example of the tree decomposition, given by the set of nodes {(0,0), (1,0), (1,1), (2,0), (2,1), (2,2), (2.3), (3,0), (3,1)}. It also schematically illustrates the principle of distribution of blocks of memory for the structure of the tree.

### 5.4. Rapid prototyping algorithm of pipeline DWPT processor

The prototype of the DWPT processor can be specified as parameters describing the structure of two-channel filter bank and the vector that defines the limit tree decomposition.

The method of rapid prototyping can be described by the following sequence of actions.

Calculating the lifting structure of the dual filter bank based on the original wavelet basis functions.

Translating the mathematical model for fixed-point arithmetic with the requirements of accuracy, and limitation of hardware resources (registers and bit computing units).

Forming a parameters vector for configure a DWPT processor prototype.

Estimating the cost of hardware prototype implementation.

Estimating computation characteristics of the DWPT procressor prototype.

Generating the output files of the synthesized VHDL-description of the DWPT processor.

### 5.5. FPGA based hardware implementation of the pipeline DWPT processor

For estimation of performance and resource utilization the present architecture has been implemented on Xilinx FPGA XC3s2000. The realized pipeline DWPT processor has following features. The number of decomposition levels is limited by eight. The mother wavelet function Db8 (16 taps) transformed into nine lifting steps is used. The input and output data has the 16 bits word length, the capacity of internal computing is 18 bits. The present implementation doesn’t have FIFO stages in PU that allows minimizing hardware resources. The processed frame size can be selected in a range from 128 to 1024 samples. Each BSU contains the pair of two 1024×16 bits block RAMs that is used for realizing double buffering scheme. The PU hardware resources utilization are shown in a table 5 and complete processor implementation resources are presented in table 6.

788 | |

226 | |

18 |

31356 | 76 | |

3037 | 7 | |

16 | 40 | |

40 | 100 |

In the figure 18 the protopyte board of dynamic reconfigurable pipeline DWPT processor is shown.

The implemented design performance is 8 MSPS. So, if the sample rate of input audio signal is 44100 Hz then the time cost for computation of wavelet coefficients is 0.6% from all time resource. For example, the 512 samples frame size (~11.6 ms) processing take approximately 0.064 ms on presented DWPT lifting based pipeline processor. The rest time is distributed between the dynamic DWPT tree decomposition algorithms, wavelet coefficients post-processing and transfer operation.

### 5.6. DAT based dynamic reconfigurable architecture algorithm

Suppose that for some audio input frame is the space of trees structures

where

Parameters

In turn, a group of parameters

Thus, the transition of signal processing according to the DWPT tree structure

From basic principles of psychoacoustics follows that human perception of acoustic information is quite inert, from 5 ms to 300 ms. Masking forward and backward is approximately 20 ms. With a input audio signal frame length of 5 ms. and processing delay determined by a single stage parallel pipeline processor, we can assume that the delay in processing the input signal

The profile of the time parameters

Thus, the DWPT trees structure

for 1^{st} frame:

for 2^{nd} frame:

for 3^{rd} frame

This algorithm of dynamic reconfiguration allows obtaining a suboptimal solution for DWPT analysis. The advantages of the above algorithm can be summarized as: pruning method is a top-down method, DWPT pruning can be viewed as a split process, i.e. we have the temporal construction DWPT tree for each signal frame that is ideal decision for real time processing implemented in a reconfigurable hardware.

The processing of the first nine frames in the pipeline DWPT processor is shown in the figure 22 where

The complete input signal analysis in DWPT processor is demonstrated in the figure 23. The input signal is segmented on a frame with minimal overlapping and analysed. The frame length in a time is equal 22.3 ms. At the same time for frames to be processed under the current structure of the tree

The output signal synthesis in DWPT processor is demonstrated in the Figure 27. The monitoring system loads the input frame

## 6. Conclusion

In the given paper the dynamic reconfigurable lifting based adaptive DWPT processor was presented. The lifting scheme allows to reduce on halve the number of multiplications and summations and increase the processing speed. Appling DAT-based approach as the design techniques for time-varying DWPT decomposition allows us to construct dynamically adapted to input signal DWPT analysis. The reconfigurable system offers several advantages over competing alternatives: faster and smaller than general purpose hardware solutions; lower development cost than dedicated hardware solutions; dynamic reconfigurable supports multiple algorithms within a single application; multi-purpose architecture generates volume demand for a single hardware design. The proposed techniques optimize system performance and, in addition, provide a convenient framework within which on-going research in the areas of non-uniform filter bank applied to speech/audio coding algorithms and reconfigurable architectures can be synergistically combined to enable the design of reconfigurable high-performance DSP systems.

Thus, the proposed dynamic reconfigurable DWPT processor with frame-based psychoacoustic optimized time-frequency tilling is successfully applicable for several application such as monophonic full-duplex audio coding system [18] and scalable audio coding based on hybrid signal decomposition where the transient part of the signal is modelled on psychoacoustic motivated frame based adaptive DWPT in marching pursuit algorithm [24]. The advantages of this DWPT processor is better viewed by considering the DWPT growing as a splitting process, i.e. the temporal construction DWPT tree created for each signal frame presents an ideal decision for real time processing implemented in a reconfigurable hardware.