1. Introduction
The discrete wavelet packet transform (DWPT) as a generalization of the standard wavelet transform provides a more flexible choices for time–frequency (timescale) representation of signals [1] in many applications, such as the design of costeffective realtime multimedia systems and high quality audio transmission and storage. In parallel to the definition of the ISO/MPEG standards, several audio coding algorithms have been proposed that use the DWPT, in particular, adaptive wavelet packet transform, as the tool to decompose the signal [2],[3]. In practice, DWPT are often implemented using a treestructured filter bank [2], [3],[4]. The DWPT is a set of transformations that admits any type of treestructured filter bank, that provides a different time–frequency tiling map. Many architectures have been proposed for computing the discreate wavlet transform in the past. However, it is not the case for the DWPT. There are very few papers regarding the development of specific architectures for the DWPT. In [5] is designed a programmable DWPT processor using twobuffer memory system and a single multiplier–accumulator (MAC) to calculate different subbands. Method [6] exploits the inplace nature of the DWPT algorithm and uses a single processing element consisting of multipliers in parallel and adders for each lowpass and highpass filters – wavelet butterflies (is the number of filter taps) to increase the throughput. In [7] is also proposed a folded pipelined architecture to speed up the throughput. It consists of MACs communicated by memory banks to compute each level of the total decomposition levels.
Applying the lifting scheme [8] for the construction of wavelets filter bank allows significantly reduce the number of arithmetic operations that are necessary to compute the transform. A folded parallel architecture for liftingbased DWPT was presented in [9]. It consists of a group of MACs operating in parallel on the data prestored in a memory bank. In [10] is proposed an architecture using a direct implementation of a liftingbased wavelet filter to perform one level of DWPT at a time. The main drawback of these existing architectures is that they all use memory to store the intermediate coefficients and involve intense memory access during the computation. A recursive pyramid algorithm based folded architecture for computing liftingbased multilevel DWPT is presented in [11]. However, the scheduling and control complexity is high, which also introduce large numbers of switches, multiplexers and control signals. The architecture is not regular and need to be modified for different number of level of DWPT computation. A folded architecture for liftingbased wavelet filters is proposed in [12] to compute the wavelet butterflies in different groups simultaneously at each decomposition level. According to the comparison results, the proposed architecture is more efficient than the previous proposed architectures in terms of memory access, hardware regularity and simplicity, and throughput. It is necessary to notice, that the architecture of the given processor is effective only for calculation of full tree DWPT. Here there is no technique of management for DWPT with best tree searching.
Algorithm transformation techniques have been employed in highspeed DSP system design is presented in [13]. All of the above mentioned techniques are applied during the processor design phase and their implementation is time invariant. Therefore, this class of signal processing techniques is referred as static techniques. Recently, dynamic techniques both of the circuit level and algorithmic level have been proposed [14]. These techniques are based on the principles that the input signal is usually nonstationary, and hence, it is better (from a coding perspective) to adapt the algorithm and architecture to the input signal. Such systems are referred to as reconfigurable signal processing systems [15],[16]. The key goal of these techniques is to improve the algorithm performance by exploiting variability in the data and channel.
Our approach is to design of dynamic algorithm transform (DAT) for design of applicationspecific reconfigurable liftingbased DWPT pipeline processor, in particular, for audio signal processing in realtime. The principle behind DAT techniques is to define parameter of input audio signals (subband entropy) and output encoded sequences (subband rate) for the given embedded processor architecture. Adaptive wavelet analysis for audio signal processing purposes is particularly interesting if the psychoacoustic information is considered in the DWPT decomposition scale. Due to the lack of selectivity of wavelet filter banks, the psychoacoustic information is computed in the wavelet domain.
2. Flexible tree structured signal expansion based on DWPT
DWPT algorithm is a generalization of the discrete wavelet transform that can be represented as a filter bank with a tree structure [3] (see figure 1). Within a given node number
Thus, a specific node
3. Dynamic transformation of DWPT decomposition
We present adaptive DWPT tree derived via DAT’s. The principle behind DAT is to define parameter of input signals (subband entropy) and output sequences (subband rate) for the given embedded processor architecture. In other hands, DAT techniques is to construct a minimum cost subband decomposition of DWPT by maximizing the minimum masking threshold (which is limited by the perceptual entropy (
where
here
The growing decision for DWPT tree based on the given
If (3) is true we continue the subband splitting process in DWPT tree, otherwise the suboptimal decomposition for the given frame of signal is founded.
The subband splitting process is managed based on the estimated values of
where
Each allowed parent node
Schematically the DWPT tree growing process is shown in the figure 3. The example of dynamic DWPT tree structure growing level by level based on
Applying the information density
4. DWPT implementation based on lifting scheme
4.1. Factoring wavelet filters in to lifting steps
In the treebased scheme of the DWPT, each node of the tree consists of a twochannel filter bank. Each node can be broken down into a finite sequence of simple filtering steps, which are called lifting steps or ladder structures. In [20] for twochannel filter bank proposed a method of transition from the implementation on the basis of FIR filters to architecture at the based on lifting scheme. The decomposition is essentially a factorization of the polyphase matrix of the wavelet filters into elementary matrices. As discussed in [20], the lifting steps scheme consists of three phases: the first step splits the data into two subsets: even and odd; the second step recalculates the coefficients (highpass) as the failure to predict the odd set based on the even; finally the third step updates the even set using the wavelet coefficients to compute the scaling function coefficients (lowpass). This method allows to reduce on halve the number of multiplications and summations. In terms of the
The first step is to move towards the implementation of polyphase filtering algorithm [20]. The process of calculating the LF and HF components of the signal
where
The elements
This approach does not give the gain to the computational cost, but the hardware implementation allows us to reduce the operation frequency in double in compare with input data rate due to parallel computation (see figure 5).




 
 3.1029  5.1995  0.3141  0.0763  3.1769 
 0  1.6625  0  0.2920  0.0379 
 0  1  3  1  3 
The second step is a factorization of the polyphase matrix into simpler triangular matrices. The result is, in a general, the original matrix
where
The scaling
where
The inverse decomposition of a twochannel filter bank in the same terms can be expressed as
and corresponding block diagram of twochannel synthesis filter bank based on the lifting scheme shown in a figure 7.
According to the block diagram (see figure 7), the synthesis procedure is implemented as follows: at first, the input coefficients
Together the analysis and synthesis filter bank implementation based on lifting scheme is required the same number of operations (summation and multiplication) in compare with analysis only implementation based on regular FIR filter implementation.
4.2. The algorithm implementation based on fixedpoint variable format arithmetic
A number of the target application requirements (work in real time, greater throughput, and other) make it necessary to use fixedpoint arithmetic to perform the specified computation. With the implementation of twochannel filter bank based on lifting structures using integer arithmetic, the number of difficulties arise, related to the fact that values of the coefficients of the polynomials
In accordance with this approach, any number represented in two's complement fixedpoint format, is given in the form of expression:
Here,
For a given in (13) format, the operations of addition and multiplication of a and b (
The figure 9 schematically illustrates the process of performing operations described above in (14) and (15).
In this paper a generic set of processing elements is proposed to implement of analysis and synthesis banks on the lifting structures using variable arithmetic format (see table 2). In this table
For an explanation of the practical aspects related to the use of the proposed arithmetic, below an example of the twochannel analysis bank realization on the mother wavelet function db4 [23] is considered. In result of polyphase matrix factorization for a given wavelet basis in the MATLAB environment the following expression was obtained:
The coefficients
In the figure 10 shown a block diagram of the implementation of twochannel filter bank analysis for this example. In this scheme, apart from computing elements
In the figure 11a in more detail the first step realization of the analysis bank is considered and in the figure 11b the last step realization of synthesis bank is shown.
As can be seen from the figure 10 and figure 11, the computing units of analysis and synthesis procedures in terms of implementation differ only in the signs of constant multiplying coefficients and the arithmetic shifts positions and directions.
Based on the materials described above, a concrete realization of twochannel bank can be represented as a vector of parameters containing a set of multiplier constants, shift parameters and some additional information regarding the delay elements in the intermediate nodes of the algorithm.
4.3. Accuracy analysis of the algorithm for fixedpoint variable format
To analyse of the proposed approach in MATLAB function library was written, which simulates the process of fixedpoint calculating in filter bank with specified structure of the tree. In the figure 12 is shown the estimation error variance signal recovery, depending on the choice of bit internal registers as a result of passing through the twochannel filter bank analysis / synthesis (in the example used wavelet filters db8). This figure also shows the results of an experiment using FIR filters, the underlying of the algorithm DWPT. It can be noted that FIR filter implementation gives better results while using the same registers word length, but requires twice as many calculations. So in order to achieve the level of error variance in the 70 dB for based on lifting structures implementation requires 16bit, which are approximately two bits more than the realization based on FIR. But this drawback is compensated by a significant reduction of arithmetic operations compared to the direct implementation. Thus, we conclude that the proposed approach is more efficient in hardware implementation compared with the bank on the basis of FIR filters.
Below we consider another experiment for demonstration of the energy localization properties by using our fixedpoint DWPT algorithm realization. For this a polyharmonic signal was generated and passed through a fivelevel decomposition tree fast wavelet transform (division of the tree is carried out only in the lowfrequency components). As an example, the wavelet functions db2 family was chosen. Thus all range of amplitudes of the wavelet coefficients was divided by 40 thresholds (these values are plotted on the Xaxis of figure 13). Each threshold has been mapped to a vector of the obtained analysis wavelet coefficients on condition that these coefficients are greater than this threshold. Otherwise, the values of the coefficients were replaced by zeros (i.e. in each vector were discarded unimportant relative to a given threshold values). For all vectors was performed reconstruction procedure by synthesis filter bank. In figure 13a, figure 13b for the floatingand fixedpoint implementations, respectively, are shown: solid line – the reconstructed to original signal energy relation (in percentage) depending on the threshold values; dotted line – the percentage of "discarded" wavelet coefficients depending on the chosen threshold.
Based on these results, we note almost complete compliance of floatingpoint model with the proposed fixedpoint approach. Thus, the fixedpoint variable format algorithm DWPT implementation preserves the energy localization inherent to the wavelet packets.
5. DWPT pipeline processor with dynamic reconfigurable architecture
5.1. DAT based reconfigurable signal processing system
The structure of reconfigurable DSP system for signal analysis based on DAT approach consists of the specific microprocessor oriented on the signal processing (DSP microprocessor) and DWPT processor itself with the reconfigurable architecture. The DSP microprocessor perform several task, such as: processing wavelet coefficients
The pipeline architecture is applied for effective implementation of the DWPT algorithm. We suggest the pipeline architecture for constructing the DWPT lifting based processor. This architecture integrates the sequential connection of the homogeneous block (buffer/switch unit (BSU) and processing unit (PU)) that implement a twochannel filter bank that allows parallel calculating DWPT with an arbitrary tree structure. The maximum number of decomposition level that can be realize is 8, it is associate with the depth of
5.2. DWPT lifting based pipeline Processing Unit (PU)
The block diagram of PU is shown in the figure 15. The input sequence
5.3. Buffer/Switch Unit (BSU)
The BSU realizes double buffering scheme known as “pingpong” for providing parallel access to the data for storing results and getting source data from/for PU. The additional channel is for outputting the result data. The two output streams of samples
The momory amount
where
In the figure 17 is shown an example of the tree decomposition, given by the set of nodes {(0,0), (1,0), (1,1), (2,0), (2,1), (2,2), (2.3), (3,0), (3,1)}. It also schematically illustrates the principle of distribution of blocks of memory for the structure of the tree.
5.4. Rapid prototyping algorithm of pipeline DWPT processor
The prototype of the DWPT processor can be specified as parameters describing the structure of twochannel filter bank and the vector that defines the limit tree decomposition.
The method of rapid prototyping can be described by the following sequence of actions.
Calculating the lifting structure of the dual filter bank based on the original wavelet basis functions.
Translating the mathematical model for fixedpoint arithmetic with the requirements of accuracy, and limitation of hardware resources (registers and bit computing units).
Forming a parameters vector for configure a DWPT processor prototype.
Estimating the cost of hardware prototype implementation.
Estimating computation characteristics of the DWPT procressor prototype.
Generating the output files of the synthesized VHDLdescription of the DWPT processor.
5.5. FPGA based hardware implementation of the pipeline DWPT processor
For estimation of performance and resource utilization the present architecture has been implemented on Xilinx FPGA XC3s2000. The realized pipeline DWPT processor has following features. The number of decomposition levels is limited by eight. The mother wavelet function Db8 (16 taps) transformed into nine lifting steps is used. The input and output data has the 16 bits word length, the capacity of internal computing is 18 bits. The present implementation doesn’t have FIFO stages in PU that allows minimizing hardware resources. The processed frame size can be selected in a range from 128 to 1024 samples. Each BSU contains the pair of two 1024×16 bits block RAMs that is used for realizing double buffering scheme. The PU hardware resources utilization are shown in a table 5 and complete processor implementation resources are presented in table 6.
Resource type  Utilized, pcs.  Percentage wise, % 
4 input lookup tables  31356  76 
flipflops  3037  7 
RAMB16  16  40 
MULT18x18s  40  100 
In the figure 18 the protopyte board of dynamic reconfigurable pipeline DWPT processor is shown.
The implemented design performance is 8 MSPS. So, if the sample rate of input audio signal is 44100 Hz then the time cost for computation of wavelet coefficients is 0.6% from all time resource. For example, the 512 samples frame size (~11.6 ms) processing take approximately 0.064 ms on presented DWPT lifting based pipeline processor. The rest time is distributed between the dynamic DWPT tree decomposition algorithms, wavelet coefficients postprocessing and transfer operation.
5.6. DAT based dynamic reconfigurable architecture algorithm
Suppose that for some audio input frame is the space of trees structures
where
Parameters
In turn, a group of parameters
Thus, the transition of signal processing according to the DWPT tree structure
From basic principles of psychoacoustics follows that human perception of acoustic information is quite inert, from 5 ms to 300 ms. Masking forward and backward is approximately 20 ms. With a input audio signal frame length of 5 ms. and processing delay determined by a single stage parallel pipeline processor, we can assume that the delay in processing the input signal
The profile of the time parameters
Thus, the DWPT trees structure
for 1^{st} frame:
for 2^{nd} frame:
for 3^{rd} frame
This algorithm of dynamic reconfiguration allows obtaining a suboptimal solution for DWPT analysis. The advantages of the above algorithm can be summarized as: pruning method is a topdown method, DWPT pruning can be viewed as a split process, i.e. we have the temporal construction DWPT tree for each signal frame that is ideal decision for real time processing implemented in a reconfigurable hardware.
The processing of the first nine frames in the pipeline DWPT processor is shown in the figure 22 where
The complete input signal analysis in DWPT processor is demonstrated in the figure 23. The input signal is segmented on a frame with minimal overlapping and analysed. The frame length in a time is equal 22.3 ms. At the same time for frames to be processed under the current structure of the tree
The output signal synthesis in DWPT processor is demonstrated in the Figure 27. The monitoring system loads the input frame
6. Conclusion
In the given paper the dynamic reconfigurable lifting based adaptive DWPT processor was presented. The lifting scheme allows to reduce on halve the number of multiplications and summations and increase the processing speed. Appling DATbased approach as the design techniques for timevarying DWPT decomposition allows us to construct dynamically adapted to input signal DWPT analysis. The reconfigurable system offers several advantages over competing alternatives: faster and smaller than general purpose hardware solutions; lower development cost than dedicated hardware solutions; dynamic reconfigurable supports multiple algorithms within a single application; multipurpose architecture generates volume demand for a single hardware design. The proposed techniques optimize system performance and, in addition, provide a convenient framework within which ongoing research in the areas of nonuniform filter bank applied to speech/audio coding algorithms and reconfigurable architectures can be synergistically combined to enable the design of reconfigurable highperformance DSP systems.
Thus, the proposed dynamic reconfigurable DWPT processor with framebased psychoacoustic optimized timefrequency tilling is successfully applicable for several application such as monophonic fullduplex audio coding system [18] and scalable audio coding based on hybrid signal decomposition where the transient part of the signal is modelled on psychoacoustic motivated frame based adaptive DWPT in marching pursuit algorithm [24]. The advantages of this DWPT processor is better viewed by considering the DWPT growing as a splitting process, i.e. the temporal construction DWPT tree created for each signal frame presents an ideal decision for real time processing implemented in a reconfigurable hardware.