Latency and throughput.
Abstract
This chapter describes the implementation on field programmable gate array (FPGA) of a turbo decoder for 3GPP long-term evolution (LTE) standard, respectively, for IEEE 802.16-based WiMAX systems. We initially present the serial decoding architectures for the two systems. The same approach is used; although for WiMAX the scheme implements a duo-binary code, while for LTE a binary code is included. The proposed LTE serial decoding scheme is adapted for parallel transformation. Then, considering the LTE high throughput requirements, a parallel decoding solution is proposed. Considering a parallelization with N = 2p levels, the parallel approach reduces the decoding latency N times versus the serial decoding one. For parallel approach the decoding performance suffers a small degradation, but we propose a solution that almost eliminates this degradation, by performing an overlapped data block split. Moreover, considering the native properties of the LTE quadratic permutation polynomial (QPP) interleaver, we propose a simplified parallel decoder architecture. The novelty of this scheme is that only one interleaver module is used, no matter the value of N, by introducing an even-odd merge sorting network. We propose for it a recursive approach that uses only comparators and subtractors.
Keywords
- LTE
- WiMAX
- turbo decoder
- single interleaver
- Max LOG MAP
- parallel architecture
- FPGA
1. Introduction
The channel coding theory was intensively studied during the last decades, but the interest on this topic increased even more following the pioneering work of Berrou et al. on turbo codes [1–3].
In their early existence, the turbo codes proved to obtain great decoding performances, so that they were used in many standards as recommendations. They transformed into a more appealing solution once the processing capacity increased for the field programmable gate array (FPGA) and digital signal processor (DSP). Their implementation complexity was not prohibitive anymore, this allowing them to become mandatory.
In this context, the Third-Generation Partnership Project (3GPP) organization early proposed these novel coding techniques. It should be mentioned that turbo codes were introduced in standard by the first version of Universal Mobile Telecommunications System (UMTS) technology (in 1999). Moreover, the next UMTS releases (the following high-speed packet access) contributed with new and interesting features, while turbo coding remained still unchanged. Furthermore, several modifications were introduced by the long-term evolution (LTE) standard. Even if they were not significant as volume, their importance arose in terms of concept. In this framework, the 3GPP proposed for LTE a new interleaver scheme, while maintaining exactly the same coding structure as in UMTS. Also, the turbo codes were introduced by the Institute of Electrical and Electronics Engineers (IEEE) in 802.16 standards, known as the base for WiMAX systems.
In Ref. [4], an UMTS dedicated turbo decoding binary scheme is developed, whereas for WiMAX systems a similar duo-binary architecture is presented in Refs. [5] and [6]. Thanks to the new LTE/LTE-advanced (LTE-A) interleaver, the decoding performances are improved, as compared to the ones corresponding to the UMTS standard. In addition, the new LTE interleaver comes with native properties suited for a parallel decoding approach inside the algorithm, thus taking advantage on the main idea brought by turbo decoders (i.e., exchanging the extrinsic values between the two decoding units). In Ref. [7], a serial decoding scheme implemented on FPGA is presented. However, parallelization is still required when high throughput is required, as in the particular case of LTE systems using diversity techniques.
In the past years, many interesting parallel decoder schemes were studied by the researchers. In this context, the obtained results are measured on two directions. The direction number 1 is represented by the decoding performance degradation between the parallel and the serial solutions. The direction number 2 is the hardware resources occupied for such parallel decoder implementation. In Ref. [8], a first group of parallel decoding solutions is presented. It is based on the classical maximum a posteriori (MAP) algorithm. This method passes through the trellis twice, first time to compute the forward state metrics (FSM) and the second time to obtain the backward state metrics (BSM) and simultaneously the log likelihood ratios (LLR). Following this approach, several approaches were developed in order to reduce the theoretical latency of the decoding process of 2
In Refs. [9] and [10], a second set of parallel architectures that take advantage of the quadratic permutation polynomial (QPP) interleaver algebraic-geometric properties is described. In these works, efficient hardware implementations of the QPP interleaver are proposed. However, the parallelization factor
In Ref. [11], a third approach was reported, which consists in using a folded memory. All the data needed for parallel processing are stored on the same time. On the other hand, the main challenge of this kind of implementation is to correctly distribute the data to each decoding unit, once a memory location containing all
In this chapter, we present the optimized implementations for serial architectures for WiMAX and LTE turbo decoding schemes. Then, for LTE systems, we describe a parallel decoding architecture introduced in Refs. [12] and [13], which also relies on a folded memory-based approach. Nevertheless, the main difference as compared to the already existing solutions presented above is that our proposed approach includes only one interleaver. Additionally, with an even-odd merge sorting unit [14, 15], the parallel architecture maintains the same structure as the serial one, the only difference being given by the fact that the soft-input soft-output (SISO) decoding unit is included
Finally, we present throughput and speed results obtained when targeting a XC5VFX70T [16] chip on Xilinx ML507 [17] board. Moreover, we provide simulation curves for the three considered cases, i.e., serial decoding, parallel decoding and parallel decoding with overlap.
2. The coding scheme
2.1. WiMAX systems
Section 8.4 from 802.16 standard [18] presents the coding scheme on the basis of which the proposed decoder is implemented. Figure 1 shows the duo-binary encoder. The native coding rate is 1/3. In order to obtain other coding rates, a puncturing block must be used. Accordingly, a depuncturing block must be added to the receiver architecture.
Let us define the following parameters: coding rate
As mentioned in Ref. [6], the main problem of a convolutional turbo code (CTC) decoder implementation is represented by the amount of required hardware resources. Moreover, in order to reach the targeted high data rate, the system clock has to be fast. Equation (1) presents the decoding throughput.
For a fixed latency algorithm, according to Eq. (1), the output throughput is improved when achieving a higher clock frequency. Another way is to reduce latency using a parallel architecture; however, this increases the occupied area and may lead to a smaller clock frequency due to longer routes. Moreover, another direct constraint is the significant memory needed for storing data. This issue also affects the frequency, since a large number of used memory blocks leads to a large resource spread on chip and, obviously, longer routes.
Taking into account the previously mentioned aspects, we can conclude that all the parameters presented above are related, so that a global optimization is not possible. Consequently, we have chosen to balance each direction in order to meet throughput requirements.
2.2. LTE systems
A classic turbo coding scheme is presented in the 3GPP LTE specification, including two constituent encoders and one interleaver module (Figure 2). The data block
In order to drive back the constituent encoders to the initial state (at the end of the coding process), the switches from Figure 2 are moved from position A to position B. Since the final states of the two constituent encoders are not the same (different input data blocks produce different final state), this switching procedure generates tail bits for each encoder. These tail bits are sent together with the systematic and parity bits, thus resulting the following final sequence:
As it was previously mentioned and discussed in Ref. [7], the LTE turbo coding scheme introduces a new interleaving structure. Thus, the input sequence is rearranged at the output using:
where the interliving function π applied over the output index
The input block length
3. The decoding algorithm
3.1. WiMAX systems
The decoding architecture consists of two decoding units called constituent decoders. Each such unit receives systematic bits (in natural order or interleaved) and parity bits, as shown in Figure 1.
The block diagram implements a maximum-logarithmic-maximum A posteriori (Max-Log-MAP) algorithm. For the case of turbo binary codes, the decoder scheme will represent, in the log likelihood ratio (LLR) space, each binary symbol as a single likelihood ratio. But in the situation of turbo duo-binary codes, the decoding unit requires three likelihood ratios in the same space. If we consider the duo-binary pair
where (
The constituent decoder (Figure 4) performs the corresponding processing forward and backward over the trellis. When moving forward, the decoder computes the unnormalized metric
where the operator “maximum” is executed over all four branches entering the state
where the operator “maximum” and the normalization method are similar to Eq. (6).
The initialization with null values is carried out for all the forward and backward metrics at all states. Once the new values are computing and stored, the decoding unit executes the second step in the decoding procedure, i.e., the LLRs computing as in Eq. (4). The decoding unit starts by computing the likelihood ratio for each branch
and continues with the value
where the operator “maximum” is computed over all eight branches generated by the pair (
The decoding procedure is executed for a decided number of iterations or until a convergence criterion is reached. Then, a final decision is taken over the bits. This is achieved by computing for each bit from the pair (
where
3.2. LTE systems
The decoding architecture for the LTE systems is presented in Figure 5. The two decoding units called recursive systematic convolutional (RSC) use theoretically the MAP algorithm. The MAP solution, a classical one, ensures the best decoding performances. Unfortunately, at the same time, it is characterized by an increased implementation complexity and also it may include variables with a large dynamic range. These are the reasons why the classical solution with the MAP algorithm is used only as a reference for the expected decoding performance. When it comes to real implementation, new suboptimal algorithms have been studied: Logarithmic MAP (Log MAP) [20], Max Log MAP, Constant Log MAP (Const Log MAP) [21] and Linear Log MAP (Lin Log MAP) [22].
For the LTE systems, we consider a decoding architecture based on the Max Log MAP algorithm. This suboptimal algorithm overcomes the problems of implementation complexity and dynamic range by paying the price of lower decoding performance when compared with the MAP algorithm. However, this degradation can be maintained inside some accepted limits. Starting from the Jacobi logarithm, only the first term is used by the Max Log MAP algorithm, i.e.,
The trellis diagram for the turbo decoding architecture of the LTE systems contains eight states, as presented in Figure 6. Each state of the diagram has two inputs and two outputs. The branch metric between the states
where
Looking at the LTE turbo encoder trellis, one can notice that between two states, there are four possible values for the branch metrics:
The LTE decoding process follows a similar approach as for WiMAX systems, i.e., it moves forward and backward through the trellis.
3.2.1. Backward recursion
The algorithm moves backward over the trellis computing the metrics. The obtained values for each node are stored in a normalized manner. They will be used for the LLR computation once the algorithm will start moving forward through the trellis. We name
where
and then stored in the dedicated memory.
3.2.2. Forward recursion
When the backward recursion is finished, the algorithm moves forward through the trellis in the normal direction. This specific phase of the decoding is similar to the one for Viterbi algorithm. In this case, the storing procedure is needed only for the previous stage metrics, i.e., for computing the current stage
where
The decoding algorithm can obtain now an LLR estimated for the data bits
The likelihood of having a bit equal to 0 (or 1) is when the Jacobi logarithm of all the branch likelihood corresponds to 0 (or 1) and thus:
where “max” operator is recursively computed over the branches, which have at the input a bit of 1
4. Proposed serial decoding scheme
4.1. WiMAX systems
One important remark about the decoding algorithm is that the outputs of one constituent decoder represent the inputs for the other constituent decoder. At the same time, knowing that the interleaver and deinterleaver procedures apply over the data blocks (so the complete block is needed) in a nonoverlapping manner will allow the usage of a single constituent decoder. This decoding unit operates time multiplexed and the corresponding proposed scheme is presented in Figure 7.
In Figure 7, we can identify storing requirements: the memory blocks that store data from one semi-iteration to another and the memory blocks used from one iteration to another. IL stands for the interleaver/deinterleaver procedure, while CONTROL is the management unit, controlling the decoder functionalities. This module provides the addresses used for read and write, the signals used to trigger the forward and backward movements through the trellis, the selection for one of the two SISO units and also the control of MUX and DEMUX blocks. The input buffer is also selected since the decoding architecture can accept a new-encoded data block while still processing the previous one. The most important module shown in Figure 7 is the SISO unit, which is the decoding structure. Figure 8 depicts the block scheme of this decoding unit. One can observe the unnormalized metric computing modules BETA (backward) and ALPHA (forward) and the module GAMMA that computes the transition metric. This last one ensures also the normalization: the metrics values obtained for state
It is important to mention that some studies have been conducted regarding the normalization function. Trying to increase the system frequency (in order to reduce the decoding latency and so, to increase the decoded data throughput), one may think of removing the normalization and so to reduce the amount of logic on the critical path. This solution is not applicable because five extra bits would be needed for metrics values. From here more the memory blocks and more the complex arithmetic. Finally, all these will lead to a lower system frequency, so no benefit on this approach. On the other hand, we propose a dedicated approach to implement the metric computation blocks (ALPHA, BETA and GAMMA). Based on the trellis state, we identified the relations for each metric, 32 equations being used for transition metric computation (we remind that for each of the eight trellis states we have four possible transitions). Moreover, only 16 are distinct (the other 16 are the same) and from these 16, some are null. Using this approach, a complexity decrease is obtained.
Figure 9 depicts the timing diagram for the proposed SISO. This corresponds to the scenario with one SISO unit and some MUX and DEMUX blocks replacing the two SISO units from the theoretical decoding architecture (see Figure 7).
In Figure 9, R/W (
4.2. LTE systems
The same remark about the two SISO units from Figure 5 working in a nonoverlapping manner applies for LTE systems as for WiMAX ones. The same approach is used, i.e., the proposed decoding architecture includes only one SISO unit and some MUX and DEMUX blocks. Figure 10 depicts the block scheme of the proposed decoding architecture.
One can observe the memory blocks in Figure 10. Some are used to store data between two successive semi-iterations, respectively, between two successive iterations. Others, in dotted-line, are virtual memories used just to clarify the introduced notations. Moreover, the interleaver and deinterleaver modules are distinctively introduced in the scheme, but in fact they are the same. Both include a block memory called ILM (interleaver memory) and an interleaver. The novelty of this approach compared to the previous serial implementation proposed in Ref. [7] is the ILM. This memory will allow a fast transition to a parallel decoding architecture. The input data memories (on the left side in Figure 10) and the ILM are switched buffers, allowing new data to be written while the previous block is still decoded. The ILM is filled with the interleaved addresses; at the same time, the new data are stored in the input memories. The saved addresses are then used as read addresses for the interleaver unit and as write addresses for the deinterleaver unit. Here, we detail the way the architecture from Figure 10 works. The vectors
In detail, SISO 1 reads the input memories and starts the decoding process, outputting the computed LLRs. Having the LLRs available and the extrinsic values, the vector V2(
The SISO unit provides at the end of each semi-iteration
The memories
All the memory blocks in Figure 10 have 6144 locations, this being the maximum coded data block length defined by the standard. Only the memory blocks with the input data for SISO units have 6144 + 3 locations because they store also the tail bits. All locations contain 10 bits. Using a Matlab simulator in finite precision, it has been observed that six bits are needed for the integer part, in order to cover the dynamic range of the variables and three bits are needed for the fractional part to maintain the decoding performance close to the theoretical one, with a certain accepted level of degradation. The 10th bit is for sign.
The SISO decoding unit is similar to the one depicted in Figure 8. ALPHA and BETA modules compute the unnormalized forward metrics and the unnormalized backward metrics, respectively. The GAMMA module computes the transition metrics and executes also the normalization (the metrics for state
The L module produces the output log likelihood ratios. These are then normalized inside the NORM module. The MUX-MAX makes the inputs selection (for forward or backward trellis runs) and implements also the maximum operator. The MEM BETA module keeps the backward metrics corresponding values into the memory.
Using the same approach for both WiMAX and LTE proposed serial decoding architectures, the same remarks apply. So, for the LTE turbo decoder also, the normalization function allows a reduced dynamic range for the variables. Trying to eliminate it, in order to reduce the number of logic levels on the critical path, will not lead to a higher system frequency because again, more memory blocks are required, more complex arithmetic (since variables are expressed on more bits) is used and finally, as an overall consequence, lower clock frequency is reported for the design.
And for ALPHA, BETA and GAMMA modules inside the SISO decoding unit, again the dedicated equations are used to compute the metrics. Sixteen such relations are implemented for transition metric computation (eight states in trellis with two possible transitions each). In fact, only four equations are distinct (as indicated in Eq. (15)]. And from these four equations, one of them is null. This way the computational effort is minimized for this proposed architecture.
The interleaving and deinterleaving procedures implement the same equation. The interleaved index is computed using a modified form of Eq. (3), i.e.,
For the interleaving process, the data are written in the memory block in the natural order and then it is read in the interleaved order, while for the deinterleaver process the data are written in the interleaved order and then it is read in the natural order.
The computation in Eq. (22) is executed in three phases. First, the value for
5. Proposed parallel decoding scheme
The serial architecture described in Figure 10 for LTE systems can be reorganized in a parallel setup, by instantiating the RSC SISO module
The most important benefit brought by the proposed serial decoding scheme is the single usage of the interleaver module before the decoding stage. The ILM is updated, each time a new data block enters the decoder, while the previous block is still being decoded. This approach prepares a fast and simple transition to the parallel scheme. Considering that the factor
768 locations × 80 bits
1536 locations × 40 bits
3072 locations × 20 bits
6144 locations × 10 bits
Only two BRAMs are used, the same as in the case of serial ILM.
Figure 12 shows the ILM working principle. As one can observe, during the writing procedure, each index
For the reordering module, an even-odd merge sorting network is applied. The corresponding method was introduced by Batcher in Ref. [14] and is part of the sorting network group that includes several sorting approaches. One such example is the bubble sorting, which sorts in a repeated manner the adjacent pairs of elements. Another example is the shell sorting, which groups the input data into an array and then performs the array’s column sorting (also in a repeating manner). After each associated iteration, the array becomes one column smaller. A third example is the even-odd transposition sorting, which sorts alternatively the odd-indexed and the adjacent even-indexed elements, respectively, the even-indexed elements and the adjacent odd-indexed values. The fourth example is the bitonic sorting. The two halves of the input data are sorted in opposite directions and then jointly processed to produce one complete sorted sequence.
The even-odd merge sorting method is based on a theorem saying that any list of
In combination with the presented parallel decoding architecture, we also propose a simplified implementation for the interleaver block. As seen from Eq. (3), the arithmetic requirements for the computation of the memory addresses
By introducing the notation
it can be observed that
where
We can rewrite Eq. (3) using Eqs. (23) and (24)
The multiplications are replaced by additions, which require less hardware resources. Nevertheless, the division is still necessary for the
we can decrease the arithmetic effort needed to obtain
Using Eq. (29) in Eq. (26), the result is
All of the numerical values added in the last stage of Eq. (29) are lower than
6. Implementation results
6.1. WiMAX systems
The estimated system frequency when implementing the decoding structure on a Xilinx XC4VLX80-11FF1148 chip using the Xilinx ISE 11.1 tool is 125 MHz. The reserved chip area is around 3000 (8.37%) slices from a total of 35,840. The results are comparable with the assessments presented in [26].
The decoding latency and decoding rate corresponding to the above-mentioned clock frequency (see Table 1) are
Fclk [MHz] | Latency [μs] | Rb [Mbps] | |||||
---|---|---|---|---|---|---|---|
125 | 24 | 2.78 | 3.71 | 4.64 | 17.24 | 12.93 | 10.34 |
125 | 240 | 23.52 | 31.36 | 39.2 | 20.41 | 15.31 | 12.24 |
125 | 2400 | 230.9 | 307.8 | 384.8 | 20.79 | 15.59 | 12.47 |
The implementation delay is represented by 10 clock periods per iteration and is added to the theoretical latency of the MAP algorithm (which is 4
In Figure 16, the decoding performances are presented for a quadrature phase shift keying (QPSK) modulation, ½ rate, 1–4 iterations, a block size of 6 bytes (the smallest possible) and a transmission simulated through an additive white Gaussian noise (AWGN) channel. The results are depicted for the worst case scenarios, considering that the test was performed for the smallest block size.
6.2. LTE systems
Figures 11 and 15 show that the decoding latency is reduced in the case of parallel decoding with a factor almost equal to
For serial decoding, the native latency is computed as follows: at the first semi-iterations,
When performing tests for the parallel decoding performances, a certain level of degradation was observed, since the forward and backward metrics are altered at the data block boundaries. In order to have similar performance as in the serial decoding case, a small overhead is accepted. By introducing an overlap at each parallel block boarder, the metrics computation gains a training phase. The minimum overlap window length is selected to cover the minimum standard defined data block (in this case
Figure 17 shows this situation, for the
For even-odd merge sorting network implementation, we can study the configuration
It can be seen in Figure 19 that the sorting unit allows a pipeline data processing. Consequently, with a certain implementation delay (7 clock periods in the proposed scheme), the module provides a value belonging to the set of sorted indexes at each clock cycle.
It is important to mention that the even-odd merge sorting was selected because it allows a pipeline functioning, consuming also lower resources than the other listed methods. Some comparative results were provided in [11, 27] in terms of used resources for the application-specific integrated circuit (ASIC).
In order to evaluate the performances, we used the very high speed hardware description language (VHDL), programming language. The code was tested using ModelSIM 6.5. For the generation of RAM/ROM memory blocks, Xilinx Core Generator 14.7 was employed and the synthesis process was accomplished using Xilinx XST from Xilinx ISE 14.7. Using the above-mentioned tools, the resulted values for the decoding structure when implemented on a Xilinx XC5VFX70T-FFG1136 are the following [28]: frequency of 310 MHz and 664 flip flops and 568 LUTs for the sorting unit, respectively, a frequency of 300 MHz, 1578 flip flop registers and 1708 LUTs for the interleaver.
The values listed in Table 2 are obtained using Eqs. (32)–(34), when
3 | 4 | 3 | 4 | 3 | 4 | |
1536 | 88.08 | 117.4 | 11.28 | 15.04 | 15.85 | 21.14 |
4096 | 234.3 | 312.5 | 29.57 | 39.42 | 34.14 | 45.52 |
6144 | 351.4 | 468.5 | 44.2 | 58.9 | 48.7 | 56.02 |
Table 3 provides the corresponding throughput rate when the values from Table 2 are used.
K | L | |||||
3 | 4 | 3 | 4 | 3 | 4 | |
1536 | 17.43 | 13.07 | 136.1 | 102.0 | 96.86 | 72.64 |
4096 | 17.47 | 13.10 | 138.5 | 103.8 | 119.9 | 89.9 |
6144 | 17.48 | 13.11 | 139 | 104.2 | 125.9 | 94.4 |
As one can observe from Table 3, the serial decoding performance is similar to the theoretical one. Let us consider, for example, the case
The following performance graphs were obtained using a finite precision Matlab simulator. This approach was selected because the same outputs as the ModelSIM simulator are obtained in Matlab, while the testing time is considerably smaller.
All the simulation results were generated for the Max Log MAP algorithm. The illustrations present the bit error rate (BER) versus signal-to-noise ratio (SNR) expressed as the ratio between the energy per bit and the noise power spectral density.
Figure 20 presents the attained performances for the case of
Analyzing the results presented in Figures 20 and 21, one can conclude that the decoding performance obtained, when parallel decoding with the overlapped split method is used, is almost similar to the one for serial decoding. In contrast, the parallel decoding without the overlapped split method generates some loss in performance when compared to the serial decoding. This degradation is dependent on the parallelization factor
7. Conclusions
This chapter presented the most important aspects related to the FPGA implementation of a turbo decoder for WiMAX and LTE systems. The serial turbo decoder architectures for the two systems have been developed and efficiently implemented, important results being obtained especially for the proposed architectures of the interleaver/deinterleaver. For LTE systems, the interleaver memory ILM has been introduced. In this manner, the interleaver process effectively works only outside the decoding process itself.
The ILM has been written together with the input data, while the previous block was still under decoding. It should be outlined that this solution allows the transition from the serial to the parallel decoder in an efficient manner, involving only values that are concatenated at same memory locations. The parallel approach requires the same storing capacity (the number of BRAMs) and a single interleaver, thus adding only an even-odd merge sorting network. This unique interleaver has been implemented in an efficient configuration that uses only comparators and subtractors and no multipliers and dividers
The parallel decoding performances have been compared with the serial ones. In this context, certain degradation has been observed. In order to eliminate this degradation, a small overhead is accepted by the overlapping split that is applied to the parallel data blocks.
Acknowledgments
This work was supported by the UEFISCDI under Grant PN-II-RU-TE-2014-4-1880.
References
- 1.
C. Berrou, A. Glavieux and P. Thitimajshima, Near Shannon limit error-correcting coding and decoding: Turbo codes, IEEE Proceedings of the International Conference on Communications , Geneva, Switzerland, May 1993, pp. 1064–1070. - 2.
C. Berrou and A. Glavieux, Near optimum error correcting coding and decoding: Turbo-Codes, IEEE Transactions on Communications , vol. 44, no. 10, pp. 1261–1271, Oct. 1996. - 3.
C. Berrou and M. Jézéquel, Non binary convolutional codes for turbo coding, Electronics Letters , vol. 35, no. 1, pp. 9–40, Jan. 1999. - 4.
M. C. Valenti and J. Sun, The UMTS turbo code and an efficient decoder implementation suitable for software-defined radios, International Journal of Wireless Information Networks , vol. 8, no. 4, pp. 203–215, Oct. 2001. - 5.
C. Anghel, A. A. Enescu, C. Paleologu and S. Ciochina, CTC Turbo decoding architecture for H-ARQ capable WiMAX systems implemented on FPGA, Ninth International Conference on Networks ICN 2010, Menuires, France, April 2010. - 6.
C. Anghel, A. A. Enescu, et al., FPGA implementation of a CTC Decoder for H-ARQ compliant WiMAX systems, Proceedings of International Conference on Design & Technology of Integrated Systems , DTIS 2007, Morocco, pp. 82–86. - 7.
C. Anghel, V. Stanciu, C. Stanciu and C. Paleologu, CTC Turbo decoding architecture for LTE systems implemented on FPGA, IARIA ICN 2012 , Reunion, France, 2012. - 8.
S. Chae, A low complexity parallel architecture of turbo decoder based on QPP interleaver for 3GPP-LTE/LTE-A, http://www.design-reuse.com /articles/31907/turbo-decoder-architecture-qpp-interleaver-3gpp-lte-lte-a.html - 9.
Y. Sun and J. R. Cavallaro, Efficient hardware implementation of a highly-parallel 3GPP LTE/ LTE-advance turbo decoder, Integration, the VLSI Journal , vol. 44, no. 4, pp. 305–315, Sept. 2011. - 10.
D. Wu, R. Asghar, Y. Huang and D. Liu, Implementation of a high-speed parallel turbo decoder for 3GPP LTE terminals, ASICON ’09, IEEE 8th International Conference on ASIC , pp. 481–484, 2009. - 11.
C. Studer, C. Benkeser, S. Belfanti and Q. Huang, Design and implementation of a parallel turbo-decoder ASIC for 3GPP-LTE, IEEE Journal of Solid-State Circuits , vol. 46, no. 1, pp. 8–17, Jan. 2011. - 12.
C. Anghel and C. Paleologu, Simplified parallel architecture for LTE-A turbo decoder implemented on FPGA, Proceedings of the 9th International conference on Circuit, Systems, Signal and Telecommunications CCST 2015 , Dubai, pp. 102–111. - 13.
C. Stanciu, C. Anghel and C. Paleologu, Efficient recursive implementation of a quadratic permutation polynomial interleaver for LTE systems, Revue Roumaine, des Sciences Techniques - Serie Électrotechnique et Énergétique , ISSN: 0035-4066, vol. 61, pp. 53–57. - 14.
K. E. Batcher, Sorting networks and their applications,” in Proceeding of AFIPS Spring Joint Computer Conference , vol. 32, 1968. - 15.
C. Anghel, C. Stanciu and C. Paleologu, Sorting methods used in parallel turbo decoding for LTE systems, 2015 International Symposium on Signals, Circuits and Systems (ISSCS) , 9–10 July, 4 p. - 16.
Xilinx Virtex 5 family user guide, https://www.xilinx.com/support/documentation/user_guides/ug190.pdf - 17.
Xilinx ML507 evaluation platform user guide, https://www.xilinx.com/products/boards/ml507/docs.htm - 18.
https://standards.ieee.org/about/get/802/802.16.html - 19.
3GPP TS 36.212 V8.7.0 (2009-05) Technical Specification, “3rd Generation Partnership Project; Technical Specification Group Radio Access Network; Evolved Universal Terrestrial Radio Access (E-UTRA); Multiplexing and channel coding (Release 8).” - 20.
P. Robertson, E. Villebrun and P. Hoeher, A comparison of optimal and sub-optimal MAP decoding algorithms operating in the log domain, Proceeding of IEEE International Conference on Communications (ICC’95), Seattle, pp. 1009–1013, June 1995. - 21.
S. Papaharalabos, P. Sweeney and B. G. Evans, Constant log-MAP decoding algorithm for duo-binary turbo codes, Electronics Letters , vol. 42, no. 12, pp. 709–710, June 2006. - 22.
J.-F. Cheng and T. Ottosson, Linearly approximated log-MAP algorithms for turbo decoding, Vehicular Technology Conference Proceedings , 2000. VTC 2000-Spring Tokyo. 2000 IEEE 51st vol. 3, pp. 2252–2256, 2000. - 23.
Massachusetts Institute of Technology, Mathematics, last access date: November 2014, math.mit.edu/~shor/18.310/batcher.pdf - 24.
R. Asghar, D. Wu, J. Eilert and D. Liu, Memory conflict analysis and a re-configurable interleaver architecture supporting unified parallel turbo decoding, Journal of Signal Processing Systems , vol. 60, no. 1, pp. 15–19, July 2010. - 25.
S. Wang, L. Liu and Z. Wen, High speed QPP generator with optimized parallel architecture for 4G LTE-A system, International Journal of Advancements in Computing Technology , vol. 4, no. 23, pp. 355–364, July 2010. - 26.
Xilinx, IEEE 802.16e CTC decoder core, DS137 (v2.3), July 11, 2006. - 27.
E. Mumolo, G. Capello and M. Nolich, VHDL design of a scalable VLSI sorting device based on pipelined computation,” Journal of Computing and Information Technology - CIT 12 , vol. 12, no. 1, pp. 1–14, 2004. - 28.
C. Anghel, C. Stanciu and C. Paleologu, LTE turbo decoding parallel architecture with single interleaver implemented on FPGA, Springer Verlag Circuits, Systems and Signal Processing , ISSN: 0278-081X, DOI 10.1007/s00034-016-0362-z, 2016.