Many-Core Algorithm of the Embedded Zerotree Wavelet Encoder

In the literature, the image compression was implemented using a variety of algorithms; such as vector quantization and subband coding and transform-based schemes. The current problem is that the selection of an image compression algorithm depends on criteria of compression ratio, but the quality of reconstructed images depends on the technology used. Some papers about of the wavelet transform-based coding show this field as an emerging option for image compression with high coding efficiency. It is well known that the new wavelet-based image compression scheme JPEG-2000 has been standardized. This chapter shows the developed novel algorithm executed in parallel using the embedded Zerotree wavelet coding scheme, in which the programs integrate parallelism techniques to be implemented and executed in the many-core system Epiphany III.


Introduction
Wavelet transform consists in uses a compact multi-resolution representation of an image, allows uses the energy compaction to exploit redundancy and to achieve compression [1].
The discrete wavelet transform (DWT) usually was used in the literature using a two-channel wavelet filter bank in a recursive process [2].
A two-dimensional image of the DWT type is usually calculated using a separable approach, this consists of scanning the input image in a horizontal direction, and passing it through the decomposition filters passes low and passes high [3]. All the selected data are subsampled vertically in order to classify the low frequency and highfrequency data in the horizontal direction. The result produces output data which is scanned vertically; the filters are repeated to generate characteristic frequency subbands [4]. After the subsampling stage, the transformation generates four LL, LH, HL and HH subbands, where each image represents 25% of the original image size [5][6][7][8].
In this particular case, the energy was concentrated in the low-frequency LL subband, so it represents a low-resolution version of the original image. The most frequent subbands contain very detailed information in three directions (horizontal, vertical and diagonal).
At the end of the process, the image was decomposed by applying the 2-D DWT algorithm to the LL subband [9]. With this iterative process, multiple levels of transformation will be generated where the energy is fully compacted and represented with few low-frequency coefficients [10]. The above can be seen in Figure 1(a) and (b). Figure 1 shows an example of decomposition of three levels of an image using the wavelet transformation, and both images defined the parent/ child relationship between levels.

Many-core technology
Epiphany III is a low-cost embedded system formed with the main memory and 16 cores distributed in a mesh, as shown in Figure 2. The system is characterized by having low energy consumption and a high level of parallelism, concurrency and high computing power. All these features and in combination, allow data to be processed at different levels of software and hardware, thereby performing operations at each core [11,12].
A multi-core system needs a shared memory space that consists of 232 bytes for this system. The memory addresses for access are accessed as unsigned numbers, from 0 to 232, together, they represent 230 of 32-bit words in which any core has access concurrently through a 2D mesh type topology [13]. The use of this interconnection topology avoids overloading or blocking access to shared memory that is reported in the literature as a factor that affects the high-performance system [14][15][16][17].

Figure 2.
Components of the Epiphany III architecture [11].  This system has an energy consumption of 1 W per 50 Gigaflops in simple precision calculations; these cores measure 65 nm. This technology is RISC and can be run at a speed of 1 GHz, and the 16 cores have the same architecture [11,14].
With this memory capacity in the Epiphany III system, it is possible to process an image of (128 × 128 pixels) that is equivalent to 16 kB of memory.

Embedded Zerotree coding
A wavelet represents a waveform of limited duration that has an average value close to zero. A wavelet transform modifies a signal from the time domain to the whole time scale domain. Wavelet coefficients are two-dimensional, given this, an image can be represented using trees due to the subsampling that is performed in the transformation.
Fourier analysis allows dividing a signal into sine waves of various frequencies, due to this the wavelet analysis is the breaking of a signal in displaced and scaled versions of the original or mother wavelet.
In zerotree coding, each wavelet coefficient of an arbitrary scale can be related to a set of coefficients in the next more excellent scale.
Zerotree root (ZTR) represents a low-scale zero value coefficient for which all larger scale coefficients have that same value. By specifying a ZTR, the encoder can track and reset all related coefficients on a larger scale.
The EZW encoder is a particular type of encoder used to encode image or sound signals of any dimension; it offers the advantages found in denoising algorithms.
When using the wavelet transform on an input image, the embedded zerotree encoding (EZW) will allow the encoder to quantify the coefficients using a binary encoding to create a representation of the image [18]. EZW uses the direct relationship between the upper and lower level coefficients (parents and children) to obtain maximum coding efficiency [19,20].
In order to perform the EZW encoding, it is necessary to perform the following steps: STEP 1: Determine the initial threshold using bit plane coding, where the subsequent iterations, the threshold (Ti) is reduced by half, and the coefficients <2Ti are only encoded in each flow.
EZW involves two passes, as a recursive process: • Dominant pass.
• Subordinate pass. b. Subordinate pass: Contains the magnitudes of the significant coefficients for each threshold, involves the use of two passes: 1. Key stage.
In the dominant step, the magnitude of the wavelet coefficients is compared to an arbitrary threshold value; the essential data is determined, and the coefficients are defined with an absolute value. The scanning was done with spatial frequencies; two bits are used to define the sign and the position of the significant coefficients. The positive significant coefficient and the non-significant coefficients are above an arbitrary threshold, usually starting with the two highest powers, two below the maximum wavelet value, where the wavelet coefficient is insignificant and has a significant descendant. STEP 3: Dominant pass (significant pass): Let wavelet coefficients in the dominant list are compared with Ti to determine the importance and, if significant, its sign. The resulting significance map is coded and sent by a zero tree. The inclusion of the ZTR symbol increases the coding efficiency because the encoder maximizes the correlation between image scales. STEP 4: Four symbols are used to form a code: 1. Zerotree root.

Significant negative
The EZW technique can be significantly improved using entropy coding as a preoccupation to achieve better compression [22]. STEP 5: Define the entropy code. The low pass filter follows, where significant coefficients are detected and refined under the successive approach quantification (SAQ ) approach [21]. STEP 6: Define refinement pass. STEP 7: The entropy code sequence of 1 and 0 is defined, and adaptive AC is used, and send STOP.

Proposed algorithm
The development of proposed EZW (embedded zerotree wavelet) image coding has attracted considerable attention among researchers [23][24][25]. It is the most popular wavelet-based compression algorithm and is widely used in several imagebased applications. In this paper is used the recursive transformation method for multi-level decomposition, where the result data is then preprocessed before of the zerotree compression, the block diagram of a wavelet-based image coding algorithm is shown in Figure 3.
Our objective of this paper is to show our proposed algorithm to enhance the compression of an image eviting to minimal loss during reconstruction.
Our algorithm was applied as a preprocessing stage; this allows to eliminate unused data in the transformed image that is not important and significant in the reconstruction of the image.
However, it is necessary to mention that more bits are required during compression and the processing time increases so that a parallel proposal can help.
When compensation is used to a greater extent between the compression ratio and reconstruction in image quality, it is possible to eliminate irrelevant data in the image where higher compression is achieved with a slight reduction in quality [21].
In an image transformed by wavelet, the coefficients represent a low-resolution image in the LL subband. The high-frequency subbands contain subbands of specific data in each direction. The use of the three subbands contributes to a smaller scale within the image reconstruction process because the coefficients are mostly zero, and few large values correspond to border information and textures in the image [22].
In this article, an algorithm was proposed to reduce less essential data in order to achieve higher compression, and thus preserve high-value coefficients, thereby eliminating the lowest values of the minimum value subband.
Our proposed algorithm uses the weight calculation method to have the minimum value subband for each level. The weight of each subband is calculated by adding the absolute value in all subband coefficients. Only three subbands (LHi, HLi and HHi) with a minimum weight are seen in detail.

Ws = ∑ abs ( coefficients in subbands ) ( min (subbandi)
= Subband with minimum weight, at the level I After finding the required subbands in each level, the algorithm reduces the depreciable data in these subbands depending on the importance of the data for preserver the reconstruction. In this age, the majority of the values are close to zero, and the coefficients have the smallest data in each subband, where it is used as a threshold to eliminate low-valued data in that subband.
The coefficients whose value is higher than a set threshold value are retained, and the value is near to zero are deleted. In our experiments, it was used different two threshold values to show the effect of compressed output and reconstructed image.
In the zerotree coding, the reduction of low valued significant coefficients in minimum weight subbands, result in higher compression ratio with a slight loss in decoded PSNR. Our results show that this algorithm shows better efficiency with a cost of negligible loss in picture quality.

Parallel implementation
In the Epiphany, III System is used a memory model and also a programming model, in a flow system that includes single instruction multiple data (SIMD), single program multiple data (SPMD), master-slave programming, multiple instruction multiple data (MIMD).
First, the operating system performs a review of the hardware of the Epiphany system and then begins to configure the cores that are in the mesh-type topology and then the distributed information of the matrices A, B. This is due to the structural nature of the epiphany system, and finally the distribution of tasks in small blocks that is appropriate (Figure 4). For this distribution, the programming model single program multiple data (SPMD) is used [27], which is responsible for distributing the execution for each of the cores, as shown in Figure 5.

Distributions of data and tasks
Depending on the problem. It can choose between two approaches to decompositions: data decomposition or task decomposition.
Establish strategies to choose decompositions: decomposition of instructions or decomposition of tasks.
Decomposition of instructions: this refers to the distribution of instructions for the cores, which will handle the processing. Then, it describes the process of distributing the instructions that can execute in parallel on different cores.
In which the results of each one of the executions of each nucleus, of this type of executions in parallel accelerates n times faster if a single nucleus were executed. Except for the minimum delay involved in the initial distribution of workload or instructions, and the final collection of data results, resulting in linear acceleration with the number of cores.
Decomposition of tasks: this refers to the distribution of tasks for each of the cores, which will be responsible for processing tasks. That is to say that the whole program is devoted to tasks. A task is a sequence of the program that can be executed in parallel, concurrent with other tasks. This approach is beneficial when tasks maintain high levels of independence.  If each task and algorithm are different, then both can be implemented, the functional decomposition, in which the particular characteristics of each type of task or algorithms will be used to execute them in the nucleus for that purpose.
Recalling that the new technologies of embedded systems have integrated circuits with several cores, in which the cores perform threads that are the tasks that are executed in parallel, so the applications will be made in less time. This allows increasing the parallelism and the concurrence in hardware and software. It is also important to make the distribution between cores, the tasks, as well as the synchronization between the cores.
Matrices are mathematical operations that scientists use. In which, the algorithm used for matrix multiplications that run in parallel is described by IBM [28]. The importance of data communication between neighboring cores according to the Cannon algorithm [29] is also mentioned. The memory in each core represents a challenge in the implementation because it is limited, which makes it necessary to use the available memory space for communication between the cores. These are some important factors for the system dedicated to parallel processing. Figure 6 shows the multiplications of matrices for each core. It also shows the sending of data for the execution of the tasks, between each core, so the Epiphany III system incorporates the mechanism, message passing, to synchronize the sending of the threads avoiding conflicts between the cores or accesses to the shared global memory. Figure 6 shows the data flow for matrix multiplication that runs in specific numbers of steps, this is determined by the quadratic root of √P. Then P is the number of cores, in which the matrix multiplication is the data set of size √P × √P.
At each repetition, element C of the product matrix obtained, then matrix A moves down and matrix B moves to the right. This example can be programmed using the standard high-level ANSI programming language "C." The Epiphany III system provides specific functions to simplify the programming of many cores, thanks to the open operating system, but its use is not mandatory for programmers.
The implementation of the algorithm, in the Epiphany III system with 16 cores, operating at 1 GHz, which solves matrix multiplication of 128 × 128 in 2 ms. The performance of the Epiphany III system grows linearly, using appropriate programming and data distribution models, and this is seen in the cost/performance of an optimized system [11,[30][31][32][33].
To exploit the parallelism, in the hardware platform epiphany, it is encouraged that the application is broken down into tasks (code portions). Thus, each of the cores could run a program part in parallel to other cores. The decomposition of tasks must be followed by a synchronization of the different parties involved to ensure data consistency.
In Figure 6, it represents the matrix mathematically. It also shows the simplification of matrix multiplication. This multiplication is represented by (2).
The input matrices are A and B, C is the resulting matrix of A i and B j , in which they have as coordinate (i, j), which is (row, column), these are elements of the matrix. The procedure to program matrix multiplication in a single core is shown below.
The previous code used the standard language C, which compiles and executes in a single core.
If these matrices A, B, will proceed to run on each of the cores in the system and the result of each element of the matrix C. are placed in the local memory of each of the cores.

Experiment results
Different image outputs of the wavelet-based compression were shown in Figure 7. Table 1 shows the values when was applied our proposed, Table 2 shows the obtained values using adaptive arithmetic coding.

Conclusion
The above method exploits the property of tradeoff between compression ratio and output PSNR and reduces the least essential data in order to attain further compression. The better compression ratio is achieved compared to original EZW coder after applying threshold with a slight reduction in PSNR during reconstruction.