Open Access is an initiative that aims to make scientific research freely available to all. To date our community has made over 100 million downloads. It’s based on principles of collaboration, unobstructed discovery, and, most importantly, scientific progression. As PhD students, we found it difficult to access the research we needed, so we decided to create a new Open Access publisher that levels the playing field for scientists across the world. How? By making research easy to access, and puts the academic needs of the researchers before the business interests of publishers.

We are a community of more than 103,000 authors and editors from 3,291 institutions spanning 160 countries, including Nobel Prize winners and some of the world’s most-cited researchers. Publishing on IntechOpen allows authors to earn citations and find new collaborators, meaning more people see your work not only from your own field of study, but from other related fields too.

WIMAX has gained a wide popularity due to the growing interest and diffusion of broadband wireless access systems. In order to be flexible and reliable WIMAX adopts several different channel codes, namely convolutional-codes (CC), convolutional-turbo-codes (CTC), block-turbo-codes (BTC) and low-density-parity-check (LDPC) codes, that are able to cope with different channel conditions and application needs.

On the other hand, high performance digital CMOS technologies have reached such a development that very complex algorithms can be implemented in low cost chips. Moreover, embedded processors, digital signal processors, programmable devices, as FPGAs, application specific instruction-set processors and VLSI technologies have come to the point where the computing power and the memory required to execute several real time applications can be incorporated even in cheap portable devices.

Among the several application fields that have been strongly reinforced by this technology progress, channel decoding is one of the most significant and interesting ones. In fact, it is known that the design of efficient architectures to implement such channel decoders is a hard task, hardened by the high throughput required by WIMAX systems, which is up to about 75 Mb/s per channel. In particular, CTC and LDPC codes, whose decoding algorithms are iterative, are still a major topic of interest in the scientific literature and the design of efficient architectures is still fostering several research efforts both in industry and academy.

In this Chapter, the design of VLSI architectures for WIMAX channel decoders will be analyzed with emphasis on three main aspects: performance, complexity and flexibility. The chapter will be divided into two main parts; the first part will deal with the impact of system requirements on the decoder design with emphasis on memory requirements, the structure of the key components of the decoders and the need for parallel architectures. To that purpose a quantitative approach will be adopted to derive from system specifications key architectural choices; most important architectures available in the literature will be also described and compared.

The second part will concentrate on a significant case of study: the design of a complete CTC decoder architecture for WIMAX, including also hardware units for depuncturing (bit-deselection) and external deinterleaving (sub-block deinterleaver) functions.

2. From system specifications to architectural choices

The system specifications and in particular the requirement of a peak throughput of about 75 Mb/s per channel imposed by the WIMAX standard have a significant impact on the decoder architecture. In the following sections we analyze the most significant architectures proposed in the literature to implement CC decoders (Viterbi decoders), BTC, CTC and LDPC decoders.

2.1. Viterbi decoders

The most widely used algorithm to decode CCs is the Viterbi algorithm Viterbi, 1967, which is based on finding the shortest path along a graph that represents the CC trellis. As an example in Fig. 1 a binary 4-states CC is shown as a feedback shift register (a) together with the corresponding state diagram (b) and trellis (c) representations.

In the given example, the feedback shift register implementation of the encoder generates two output bits, c1 and c2 for each received information bit, u; c1 is the systematic bit. The state diagram basically is a Mealy finite state machine describing the encoder behaviour in a time independent way: each node corresponds to a valid encoder state, represented by means of the flip flop content, e1 and e2 , while edges are labelled with input and output bits. The trellis representation also provides time information, explicitly showing the evolution from one state to another in different time steps (one single step is drawn in the picture).

At each trellis step n, the Viterbi algorithm associates to each trellis state S a state metric ΓS n that is calculated along the shortest path and stores a decision dS n , which identifies the entering transition on the shortest path. First, the decoder computes the branch metrics (γn ), that are the distances from the metrics labelling each edge on the trellis and the actual received soft symbols. In the case of a binary CC with rate 0.5 the soft symbols are λ1n and λ2n and the branch metrics γn (c2,c1) (see Fig. 2(a)). Starting from these values, the state metrics are updated by selecting the larger metric among the metrics related to each incoming edge of a trellis state and storing the corresponding decision dS n . Finally, decoded bits are obtained by means of a recursive procedure usually referred to as trace-back. In order to estimate the sequence of bits that were encoded for transmission, a state is first selected at the end of the trellis portion to be decoded, then the decoder iteratively goes backward through the state history memory where decisions dS n have been previously stored: this allows one to select, for current state, a new state, which is listed in the state history trace as being the predecessor to that state. Different implementation methods are available to make the initial state choice and to size the portion of trellis where the trace back operation is performed: these methods affect both decoder complexity and error correcting capability. For further details on the algorithm the reader can refer to Viterbi, 1967, [Forney, 1973]. Looking at the global architecture, the main blocks required in a Viterbi decoder are the branch metric unit (BMU) devoted to compute γn , the state metric unit (SMU) to calculate ΓS n and the trace-back unit (TBU) to obtain the decoded sequence. The BMU is made of adders and subtracters to properly combine the input soft symbols (see Fig. 2 (a) ). The SMU is based on the so called add-compare select structure (ACS) as shown in Fig.2 (b) . Said i the i-th starting state that is connected to an arriving state S by an edge whose branch metric is γi n-1, then ΓS n is calculated as in (1).

ΓSn=maxi{Γin−1+γin−1}E1

(1)

As it can be inferred from (1) ΓS n is obtained by adding branch metrics with state metrics, comparing and selecting the higher metric that represents the shortest incoming path. The corresponding decision dS n is stored in a memory that is later read by the TBU to reconstruct the survived path. Due to the recursive form of (1), as long as n increases, the number of bits to represent ΓS n tends to become larger. This problem can be solved by normalizing the state metrics at each step. However, this solution requires to add a normalization stage increasing both the SMU complexity and critical path. An effective technique, based on two complement representation, helps limiting the growth of state metrics, as described in Hekstra, 1989.

The WIMAX standard specifies a binary 64 states CC with rate 0.5, whose shift register representation is shown in Fig. 3. Usually Viterbi decoder architectures exploit the trellis intrinsic parallelism to simultaneously compute at each trellis step all the branch metrics and update all the state metrics. Thus, said n the number of states of a CC, a parallel architecture employs a BMU and n ACS modules. Moreover, to reduce the decoding latency, the trace-back is performed as a sliding-window process Radar, 1981 on portions of trellis of width W. This approach not only reduces the latency, but also the size of the decision memory that depending on the TBU radix requires usually 3W or 4W cells Black & Meng, 1992.

To improve the decoder throughput, two Black & Meng, 1992 or more Fettweis & Meyr, 1989, Kong & Parhi, 2004, Cheng & Parhi, 2008 trellis steps can be processed concurrently. These solutions lead to the so called higher radix or M-look-ahead step architectures. According to Kong & Parhi, 2004, the throughput sustained by an M-look-ahead step architecture, defined as the number of decoded bits over the decoding time is

T=k⋅NT⋅fclkNT/M+W≈fclk⋅M⋅kE2

(2)

where fclk is the clock frequency, NT is the number of trellis steps, k=1 for a binary CC, k=2 for a double binary CC and the right most expression is obtained under the condition W<< NT that is a reasonable assumption in real cases.

Thus, to achieve the throughput required by the WIMAX standard with a clock frequency limited to tens to few thousands of MHz, M=1 (radix-2) or M=2 (radix-4) is a reasonable choice.

However, since CCs are widely used in many communication systems, some recent works as Batcha & Shameri, 2007 and Kamuf et al., 2008 address the design of flexible Viterbi decoders that are able to support different CCs. As a further step Vogt & When, 2008 proposed a multi-code decoder architecture, able to support both CCs and CTCs.

2.2. BTC decoders

Block Turbo Codes or product codes are serially concatenated block codes. Given two block codes C1=(n1,k1,δ1) and C2=(n2,k2,δ2) where ni , ki and δi represent the code-word length, the number of information bits, and the minimum Hamming distance, respectively, the corresponding product code is obtained according to Pyndiah, 1998 as an array with k1 rows and k2 columns containing the information bits. Then coding is performed on the k1 rows with C2 and on the n2 obtained columns with C1 . The decoding of BTC codes can be performed iteratively row-wise and column-wise by using the sub-optimal algorithm detailed in Pyndiah, 1998. The basic idea relies on using the Chase search Chase, 1972 a near-maximum-likelihood (near-ML) searching strategy to find a list of code-words and an ML decided code-word d={d0,…, dn-1} with dj{-1,+1}. According to the notation used in [Vanstraceele et al., 2008], decision reliabilities are computed as

λ(dj)≈|r_−c_−1(j)|2−|r_−c_+1(j)|24E3

(3)

where r={r0,…rn-1} is the received code-word and c-1(j) and c+1(j) are the code-words in the Chase list at minimum Euclidean distance from r such that the j-th bit of the code-word is -1 and +1 respectively. Then one decoder sends to the other the extrinsic information

wjout=λ(dj)−rjE4

(4)

If the Chase search fails the extrinsic information is approximated as

wjout=β⋅djE5

(5)

where β is a weight factor increasing with the number of iterations.

The decoder that receives the extrinsic information uses an updated version of r obtained as

rjnew=rjold+α⋅wjinE6

(6)

where is a weight factor increasing with the number of iterations. A scheme of the elementary block turbo decoder is shown in Fig. 4 where the block named “decoder” is a Soft-In-Soft-out (SISO) module that performs the Chase search and implements (3), (4) and (5). An effective solution to implement the SISO module is based on a three pipelined stage architecture where the three stages are identified as reception, processing, and transmission units Kerouedan & Adde, 2000. As detailed in LeBidan et al., 2008, during each stage, the N soft values of the received word r are processed sequentially in N clock periods. The reception stage is devoted to find the least reliable bits in the received code-word. The processing stage performs the Chase search and the transmission stage calculates λ(dj), wj and rj new . Another solution is proposed in Goubier et al. 2008 where the elementary decoder is implemented as a pipeline resorting to the mini-maxi algorithm, namely by using mini-maxi arrays to store the best metrics of all decoded code-words in the Chase list.

Several works in the literature deal with BTC complexity reduction. As an example Adde & Pyndiah, 2000 suggests to compute β in (5) on a per-code-word basis, whereas in Chi et al., 2004 the dependency on in (6) is solved by replacing the term ∙wj with tanh(wj/2). In Le et al. 2005 both in (6) and β in (5) are avoided by exploiting Euclidean distance property.

Due to its row-column structure, the block turbo decoder can be parallelized by instantiating several elementary decoders to concurrently process more rows or columns, thus increasing the throughput. As a significant example in [Jego et al., 2006] a fully parallel BTC decoder is proposed. This solution instantiates n1 +n2 decoders that work concurrently. Moreover, by properly managing the scheduling of the decoders and interconnecting them through an Omega network intermediate results (row decoded data or column decoded data) are not stored.

A detailed analysis of throughput and complexity of BTC decoder architectures can be found in Goubier et al. 2008 and LeBidan et al., 2008. In particular, according to Goubier et al. 2008 a simple one block decoder architecture that performs the row/column decoding sequentially (interleaved architecture) requires 2(n1 +n2 ) cycles to complete an iteration; as a consequence it achieves a throughput

T=k1⋅k2⋅fclkI⋅2(n1+n2)E7

(7)

where I is the number of iterations and fclk is the clock frequency. The BTC specified for WIMAX is obtained using twice a binary extended Hamming code out of the ones show in Table 1

N

k

15

11

31

26

63

57

Table 1.

WIMAX binary extended Hamming codes (H(n,k) ) used for BTC.

Considering the interleaved architecture described in Goubier et al. 2008 where a fully decoded block is output every 4.5 half iterations, we obtain that 75 Mb/s can be obtained with a clock frequency of 84 MHz, 31 MHz and 14 MHz for H(15,11), H(31,26) and H(63,57) respectively.

2.3. CTC decoders

Convolutional turbo codes were proposed in 1993 by Berrou, Glavieux and Thitimajshima Berrou et al., 1993 as a coding scheme based on the parallel concatenation of two CCs by the means of an interleaver (Π) as shown in Fig. 5 (a). The decoding algorithm is iterative and is based on the BCJR algorithm Bahl et al., 1974 applied on the trellis representation of each constituent CC Fig. 5 (b). The key idea relies on the fact that the extrinsic information output by one CC is used as an updated version of the input a-priori information by the other CC. As a consequence, each iteration is made of two half iterations, in one half iteration the data are processed according to the interleaver (Π) and in the other half iteration according to the deinterleaver (Π^{-1}). The same result can be obtained by implementing an in-order read/write half iteration and a scrambled (interleaved) read/write half iteration. The basic block in a turbo decoder is a SISO module that implements the BCJR algorithm in its logarithmic likelihood ratio (LLR) form. If we consider a Recursive Systematic CC (RSC code), the extrinsic information λ_{k}(u;O) of an uncoded symbol u at trellis step k output by a SISO is

where ũ is an uncoded symbol taken as a reference (usually ũ=0), e represents a certain transition on the trellis and u(e) is the uncoded symbol u associated to e. The max^{*} function is usually implemented as a max followed by a correction term Robertson et al., 1995, Gross & Gulak, 1998, Cheng & Ottosson, 2000, Classon et al., 2002, Wang et al., 2006, Talakoub et al. 2007. A scaling factor can also be applied to further improve the max or max^{*} approximation Vogt & Finger, 2000. The correction term, usually adopted when decoding binary codes, can be omitted for double binary turbo codes Berrou et al. 2001 with minor error rate performance degradation. The term b(e) in (8) is defined as

b(e)=αk−1[sS(e)]+γk[e]+βk[sE(e)]E9

(9)

αk[s]=maxe:sE(e)=s{αk−1[sS(e)]+γk[e]}E10

(10)

βk[s]=maxe:sS(e)=s{βk+1[sE(e)]+γk[e]}E11

(11)

γk[e]=πk[u(e);I]+πk[c(e);I]E12

(12)

where sS(e) and sE(e) are the starting and the ending states of e, k [sS(e)] and βk[sE(e)] are the forward and backward state metrics associated to sS(e) and sE(e) respectively (see Fig. 5 (b)) and γk[e] is the branch metric associated to e. The πk[c(e);I] term is computed as a weighted sum of the λk[c;I] produced by the soft demodulator as

πk[c(e);I]=∑incci(e)λk[ci(e);I]E13

(13)

where ci(e) is one of the coded bits associated to e and nc is the number of bits forming a coded symbol c and πk[cu(e);I] in (8) is obtained as πk[c(e); I] considering only the systematic bits corresponding to the uncoded symbol u out of the nc coded bits. The πk[u(e);I] term is obtained combining the input a-priori information λk(u;I) and for a double binary code can be written as in (14), where A and B represent the two bits forming an uncoded symbol u.

The CTC specified in the WIMAX standard is based on a double binary 8-state constituent CC as shown in Fig. 6, where each CC receives two uncoded bits (A, B) and produces four coded bits, two systematic bits (A,B) and two parity bits (Y,W). As a consequence, at each trellis step four transitions connect a starting state to four possible ending states. Due to the trellis symmetry only 16 branch metrics out of the possible 32 branch metrics are required at each trellis step. As pointed out in Muller et al. 2006 high throughput can be achieved by exploiting the trellis parallelism, namely computing concurrently all the branch and state metrics.

The 16 branch metrics are computed by a BMU that implements (12) as shown in Fig. 7. To reduce the latency of the SISO, usually the decoding is based on a sliding-window approach Benedetto et al., 1996. As a consequence, at least two BMUs are required to compute the two recursions (forward and backward) according to the BCJR algorithm. However, since β metrics require to be trained between successive windows, usually a further BMU is required. A solution based on the inheritance of the border metrics of each window Abbasfar & Yao 2003 requires only two BMUs. Furthermore, this strategy reduces the SISO latency to the sliding window width W. The state metrics are updated according to (10) and (11) by two state metric processors, each of which is made of a proper number of processing elements (PE). As shown in Fig. 7 for the WIMAX CTC 8 PEs are required. It is worth pointing out that the constituent codes of the WIMAX CTC use the circulation state tailbiting strategy proposed in Weiss et al. 2001 that ensures that the ending state of the last trellis step is equal to the starting state of the fist trellis step. However, this technique requires estimating the circulation state at the decoder side. Since training operations to estimate the circulation state would increase the SISO latency, an effective alternative Zhan et al. 2006 is to inherit these metrics from the previous iteration.

As in Viterbi decoder architectures often in CTC decoders the state metrics are computed by means of the “wrapping” representation technique proposed in Hekstra, 1989. This solution requires a normalization stage, depicted in Fig. 7, when combining , β and γ metrics to compute the extrinsic information as in (8). The last stage of the output processor, that computes the output extrinsic information, is a tree of max blocks for each component of the extrinsic information and few adders to implement (8). As highlighted in Fig. 7 this scheduling requires a buffer to store input LLRs that are used to compute the backward recursion (BMU-MEM). Since the output extrinsic information is computed during the backward recursion, forward recursion metrics are stored in a buffer (-MEM). Further memory is required to implement the border metric inheritance, -EXT-MEM, β-EXT-MEM and β-LOC-MEM.

The throughput sustained by the CTC decoder, defined as the number of decoded bits over the time required for their computation, is

where fclk is the clock frequency, NT is the number of trellis steps, k=1 for a binary CTC, k=2 for a double binary CTC, 2I is the number of half iterations, Ncyc SISO and Ncyc ID represent the number of clock cycles required by one SISO and by the interleaving/deinterleaving structure. Since both Ncyc SISO and Ncyc ID are a function of NT they can be rewritten as Ncyc SISO=NT ∙SP+SISOcyc lat and Ncyc ID =NT ∙SP+IDcyc oh where SP is the sending period, namely the rate sustained by the decoder to output two consecutive valid output data (SP=1 means at each clock cycle new valid output data are ready), SISOcyc lat is the decoder latency, namely the number of clock cycles spent to produce the first valid output data, and IDcyc oh is the interleaver/deinterleaver architecture overhead expressed in clock cycles. Usually, resorting to pipelining, Ncyc SISO and Ncyc ID can be partially overlapped; thus, the number of cycles required by one SISO decoder is Ncyc dec=NT ∙SP+SISOcyc lat +IDcyc oh . Using the sliding window technique with the border metric inheritance strategy Abbasfar & Yao 2003, Zhan et al. 2006 we obtain SISOcyc lat≈SP∙W and so (15) can be rewritten as (16), where the rightmost expression is obtained considering W<<NT and IDcyc oh <<SP∙NT that is a reasonable assumption in real cases.

T=k⋅NT⋅fclk2I⋅[SP⋅(NT+W)+IDcycoh]≈k⋅fclk2I⋅SPE16

(16)

Usually optimized architectures Masera et al., 1999, Bickerstaff et al., 2003, Kim & Park, 2008 are obtained with SP=1, whereas flexible architectures have higher SP values Vogt & Wehn, 2008, Muller et al., 2009. However, even with SP=1, a double binary turbo decoder architecture that achieves the throughput imposed by WIMAX with eight iterations (I=8), would require fclk =600 MHz. A possible solution to improve the throughput by a factor that ranges in [1.2, 1.9] is the based on decoder level parallelism Muller et al. 2006 and is usually referred to as “shuffling” Zhang & Fossorier, 2005. However, to further improve the throughput a parallel decoder made of P SISOs working concurrently is required. As a consequence, a parallel architecture achieves a throughput

Thus, setting P=4, I=8 and SP=1, the WIMAX throughput is obtained with fclk =150 MHz. It is worth pointing out that a P-parallel CTC decoder is made of P SISOs connected to P memories devoted to store the extrinsic information. However, in a parallel decoder during the scrambled half iteration collisions can occur, namely more SISOs could need to access the same memory during the same cycle. Since the collision phenomenon increases IDcyc oh , several algorithmic approaches to design collision free interleavers Giulietti et al. 2002, Kwak & Lee, 2002, Gnaedig et al., 2003, Tarable et al., 2004 have been proposed. On the other hand, architectures to manage collisions in a parallel turbo decoder have also been proposed in the literature Thul et al., 2002, Gilbert et al., 2003, Thul et al., 2003, Speziali & Zory, 2004, Martina et al. 2008-a, Martina et al., 2008-b, in particular Martina et al. 2008-b deals with the parallelization of the WIMAX CTC interleaver and avoids collision by the means of a throughput/parallelism scalable architecture that features IDcyc oh =0.

It is worth pointing out that parallel architectures increase not only the throughput but also the complexity of the decoder, so that some recent works aim at reducing the amount of memory required to implement SISO local buffers. In Liu et al., 2007 and Kim & Park, 2008 saturation of forward state metrics and quantization of border backward state metrics is proposed. Further studies have been performed to reduce the extrinsic information bit width by using adaptive quantization Singh et al., 2008, pseudo-floating point representation Park et al., 2008 and bit level representation Kim & Park, 2009.

2.4. LDPC code decoders

LDPC codes were originally introduced in 1962 by Gallager Gallager, 1962 and rediscovered in 1996 by MacKay and Neal [MacKay, 1996]. As turbo codes, they achieve near optimum error correction performance and are decoded by means of high complexity iterative algorithms.

An LDPC code is a linear block code defined by a CB parity check matrix H, characterized by a low density of ones: B is the number of bits in the code (block length), while C is the number of parity checks. A one in a given cell of the H matrix indicates that the bit corresponding to the cell column is used for the calculation of the parity check associated to the row. A popular description of an LDPC code is the bipartite (or Tanner) graph shown in Figure 8 for a small example, where B variable nodes (VN) are connected to C check nodes (CN) through edges corresponding to the positions of the ones in H.

LDPC codes are usually decoded by means of an iterative algorithm variously known as sum-product, belief propagation or message passing, and reformulated in a version that processes logarithmic likelihood ratios instead of probabilities. In the first iteration, half variable nodes receive data from adjacent check nodes and from the channel and use them to obtain updated information sent to the check nodes; in the second half, check nodes take the updated information received from connected bit nodes and generate new messages to be sent back to variable nodes.

In message passing decoders, messages are exchanged along the edges of the Tanner graph, and computations are performed at the nodes. To avoid multiplications and divisions, the decoder usually works in the logarithmic domain.

The message passing algorithm is described in the following equations, where k represents the current iteration, Qji is the message generated by VN j and directed to CN i, Rij is the message computed by CN i and sent to VN j. C[j] is the whole set of incoming messages for VN j and R[i] is the whole set of the incoming messages for CN i.

Each variable node is initialized with the log-likelihood ratio (LLR) j associated to the received bit. Next, messages are propagated from the variable nodes to the check nodes along the edges of the Tanner graph. At the first iteration, only j are delivered, while starting from the second iteration VNs sum up all the messages Rij coming from CNs and combine them with j according to

Qjik=λj+∑α∈C[j]/iRαjk−1E18

(18)

The check node computes new check to variable messages as

where |R[j]|is the cardinality of the CN and (x) is a non linear function defined as

Ψ[x]=−ln(tanh|x2|)E19

(20)

After a number of iterations that strongly depends on the addressed application and code rate (typically 5 to 40), variable nodes compute an overall estimation of the decoded bit in the form

Λjk=λj+∑α∈C[j]Rαjk−1E20

(21)

where the sign of j can be understood as the hard decision on the decoded bit.

A large implementation complexity is associated to (19), which is simplified in different ways. First of all, function (x) can be obtained by means of reduced complexity estimations Masera et al., 2005. Moreover sub-optimal, low complexity algorithms have been successfully proposed to simplify (19), such as for example the normalized Min-Sum algorithm Chen et al., 2005 where only the two smallest magnitudes are used.

A further change is usually applied to the scheduling of variable and check nodes in order to improve communications performance. In the two-phase scheduling, the updating of variable and check nodes is accomplished in two separate phases. On the contrary, the turbo decoding message passing (TDMP) Mansour & Shanbhag, 2003, also known as layered or shuffled decoding, allows for overlapped update operations: messages calculated by a subset of check nodes are immediately used to update variable nodes. This scheduling has been proved to be able to reduce the number of iterations by up to 50% at a fixed communications performance.

The required number of functional units in a decoder can be estimated based on the concept of processing power Pc Gouillod et al., 2007, which can be evaluated on the basis of the rate Rc of the code, the number K of information bits transmitted per codeword, the block size N=K/Rc , the required information throughput D, the operating clock frequency fclk , the maximum number of iterations iMAX and the total number of edges to be processed per iteration . This relation is expressed as

Pc=ε⋅D⋅iMAXK⋅fclkE21

(22)

As two messages are associated with each edge (to be sent from the CN to the VN and vice versa), 2Pc gives the number of messages that must be concurrently processed at each decoding iteration in order to achieve the target throughput D. Equation (22) does not consider the message exchange overhead: yet it assumes that all messages dispatched during a cycle are delivered simultaneously during the same cycle. The Pc value must then be assumed as a lower bound and the actual degree of parallelism strongly depends on both the structure of the H matrix Dinoi et al., 2006 and the adopted interconnect architecture among processing units Quaglio et al., 2006, Masera et al., 2007.

Actually, most of the implementation concerns come from the communication structure that must be allocated to support message passing from bit to check nodes and vice versa. Several hardware realizations that have been proposed in the literature are focused on how efficiently passing messages between the two types of processing units.

Three approaches can be followed in the high level organization of the decoder, coming to three kinds of architectures.

-Serial architectures: bit and check processors are allocated as single instances, each serving multiple nodes sequentially; messages are exchanged by means of a memory.

-Fully parallel architectures: processing units are allocated for each single bit and check node and all messages are passed in parallel on dedicated routes.

-Partially parallel architectures: more processing units work in parallel, serving all bit and check nodes within a number of cycles; suitable organization and hardware support is required to exchange messages.

For most codes and applications, the first approach results in slow implementations, while the second one has an excessive cost. As a result the only general viable solution is the third partially parallel approach, which on the other hand introduces the collision problem, already known in the implementation of parallel turbo decoders. Two main approaches have been proposed to deal with collisions:

Even if the first approach has proven to be effective, it significantly limits the supported code classes. The second approach, on the other hand, is well suited for flexible and general architectures. An even more challenging task is the design of LDPC decoders that are flexible in terms of supported block sizes and code rates Masera et al., 2007.

In partially parallel structures, permutation networks are used to establish the correct connections between functional units. However, structured LDPC codes, such as those specified in WIMAX, allow for replacing permutation networks by low complexity barrel shifters Boutillon et al., 2000, Mansour & Shanbhag, 2003.

Early terminal schemes can be adopted to improve the decoding efficiency by dynamically adjusting the iteration number according to the SNR values. The simplest approach requires that decoding decisions are stored and compared across two consecutive iterations: if no changes are detected, the decoding is terminated, otherwise it is continued up to a maximum number of iterations. More sophisticated iteration control schemes are able to reduce the mean number of iterations, so saving both latency and energy Kienle & When, 2005, Shin et al., 2007.

3. Case of study: complete WIMAX CTC decoder design

The WIMAX CTC decoder is made of three main blocks: symbol deselection (SD), subblock deinterleaver and CTC decoder as highlighted in Fig. 9 where N represents the number of couples included in a data frame. SD, subblock deinterleaver and CTC decoder blocks are connected together by means of memory buffers in order to guarantee that the non iterative part of the decoder (namely SD and subblock deinterleaver) and the decoding loop work simultaneously on consecutive data frames. Since the maximum decoder throughput is about 75 Mb/s and the native CTC rate is 1/3 (two uncoded bits produce six coded bits), at the input of the decoding loop the maximum throughput can rise up to 225 millions of LLRs per second. The same throughput ought to be sustained by the subblock deinterleaver, whereas even higher throughput has to be sustained at the SD unit in case of repetition.

3.1. Symbol deselection

Depending on amount of data sent by the encoder (puncturing or repetition), the throughput sustained by the symbol deselection (SD) can rise up to 900 millions of LLRs per second (repetition 4). When the encoder performs repetition, the same symbol is sent more than once. Thus, the decoder combines the LLRs referred to the same symbol to improve the reliability of that symbol. As shown in Fig. 9 this can be achieved partitioning the symbol deselection input buffer into four memories, each of which containing up to 6N LLRs.

Since the symbol deselection architecture can read up to four LLRs per clock cycle, it reduces the incoming throughput to 225 millions of LLRs per second. However, the symbol deselection has to compute the starting location and the number of LLRs to be written into the output buffer. The number of LLRs and the starting location are obtained as in (23) and (24) respectively, where NSCHk , mk and SPIDk are parameters specified by the WIMAX standard for the k-index subpacket when HARQ is enabled, namely NSCHk , is the number of concatenated slots, mk is the modulation order and SPIDk is the subpacket ID.

Lk=48⋅mk⋅NSCHkE22

(23)

Fk=(SPIDk⋅Lk)mod6NE23

(24)

Since NSCHk [1, 480] and mk {2, 4, 6} we can rewrite (23) as

The efficient implementation of (25) is obtained with an adder whose inputs are NSCHk and the selection between two hardwired left shifted versions of NSCHk (one position and three positions), followed by a programmable left shifter (five-six positions). Similarly, since SPIDk {0, 1, 2, 3}, the multiplication in (24) is avoided as

A block scheme of the architecture employed to compute Fk and Lk is depicted in Fig. 10 (a).

Furthermore, in order to support the puncturing mode, the output memory locations corresponding to unsent bits must be set to zero. To ease the SD architecture implementation, all the output memory locations are set to zero while Lk and Fk are computed. As a consequence, about two clock cycles per sample are required to complete the symbol deselection, namely 6N LLRs are output in 12N clock cycles. So that the symbol deselection throughput can be estimated as

TSD=6N12Nfclk=fclk2E26

(27)

As it can be observed, to sustain 225 millions of LLRs per second a clock frequency of 450 MHz is required. To overcome this problem we impose not only to partition the input buffer into four memories, but also to increase the memory parallelism, so that each memory location contains p LLRs. Thus, we can rewrite (27) as (28) and by setting p to a conservative value, as p=4, the SD architecture processes simultaneously up to sixteen LLRs with fclk =113 MHz.

TSD=6N12Npfclk=p⋅fclk2E27

(28)

3.2. Subblock deinterleaver

The received LLRs belong to six possible subblocks depending on the coded bits they are referred to (A, B, Y1 , W1 , Y2 , W2 ) and each subblock is made of N LLRs. The subblock deinterleaver treats each subblock separately and scrambles its LLRs according to Algorithm 1, given below, where m and J are constants specified by the WIMAX standard and BROm(y) is the bit-reversed m-bit value of y.

1: k←0 2: i←0 3: while i<N do 4: Tk←2m(k mod J)+BROm (k/J) 5: if Tk<N then 6: i←i+1 7: else 8: discard T_{k} 9: end if10: k←k+111: end while

As a consequence, the number of tentative addresses generated, NM , can be greater than N. Exhaustive simulations, performed on the possible N specified by the standard, show that the worst case is NM =191 that occurs with N=144. Since 191/144=1.326, a conservative approximation is NM =4N/3. The whole subblock deinterleaver architecture is obtained with one single address generator implementing Algorithm 1 to simultaneously write one LLR from each of the six subblock memories. In particular, as imposed by the WiMax standard, the interleaved LLRs belonging to the A and B subblocks are stored separately, whereas the interleaved LLRs belonging to Y1 and Y2 are stored as a symbol-by-symbol multiplexed sequence, creating a “macro-subblock” made of 2N LLRs. Similarly a macro-subblock made of 2N LLRs is generated storing a symbol-by-symbol multiplexed sequence of interleaved W1 and W2 subblocks.

Since all the subblocks can be processed simultaneously, this architecture deinterleaves six LLRs per clock cycle. As a consequence, the subblock deinterleaver sustains a throughput

TSubDein=6N4N3fclk=4.5fclkE28

(29)

Thus, a throughput of 225 Millions of LLRs per second is sustained using fclk =50 MHz.

To implement line 4 and 5 in Algorithm 1, three steps are required, namely the calculation of k mod J and k/J, the calculation of 2^{m}(k mod J) and BROm (k/J), the generation of Tk while checking Tk<N. It is worth pointing out that k mod J can be efficiently implemented as an up-counter followed by a mod J block. Moreover, each time the mod J block detects k=J, a second counter is incremented: the final value in the second counter is k/J. Since m[3, 10], the 2^{m}(k mod J) term is implemented as a programmable shifter in the range [0, 7] followed by a hardwired three position left shifter. The BROm (k/J) term is obtained by multiplexing eight hardwired bit reversal networks. Finally, a valid Tk address is obtained with an adder and is validated by a comparator. The address generation architecture is shown in Fig. 10 (b).

3.3. CTC decoder

As detailed in section 2.3 to sustain the throughput required by the WIMAX standard a parallel decoder architecture is required. To that purpose we set SP=1, I=8, and fclk =200 MHz, then from (17) we analyze the throughput as a function of N for W=32. As shown in Fig. 11, only P=4 allows to achieve the target throughput (horizontal solid line) for N≥480.

Moreover, the window width impacts both on the decoder throughput and on the depth of SISO local buffers. So that a proper W value for each frame size must be selected. In particular if N/(P∙W)ℕSISOs synchronization is simplified. However, the choice of P should minimize collisions in memory access.

Exhaustive simulations show that collisions occur for P=2 and P=4 only with N=108. As a consequence, we select P as a function of N to simultaneously obtain a monotonically increasing throughput as a function of N and to avoid collisions. It is worth pointing out that, when collisions are avoided, the resulting parallel interleaver is a circular shifting interleaver: the address generation is simplified with all SISOs simultaneously accessing the same location of different memories.

Said idx0 t the memory accessed by SISO-0 at time t during a scrambled half iteration, the memory concurrently accessed by SISO-k is idxk t =(idx0 t ±k) mod P.

Thus, the parallel CTC interleaver-deinterleaver system is obtained as a cascaded two stage architecture (see Fig. 12). The first stage efficiently implements the WIMAX interleaver algorithm, whereas the second one extracts the common memory address adxt and the memory identifiers idxk t from the scrambled address i.

The CTC interleaver algorithm specified in the WIMAX standard is structured in two steps. The first step switches the LLRs referred to A and B that are stored at odd addresses. The second step provides the interleaved address i of the j-th couple as

i=(P0⋅j+Pj')modNj=0,⋯,N−1E29

(30)

where P0 and Pj ’ are constants that depend only on N and are specified by the standard. It is worth pointing out that the two steps can be swapped, as a consequence the first step can be performed on-the-fly, avoiding the use of an intermediate buffer to store switched LLRs. A simple architecture to implement (30) can be derived by rewriting (30) as

A small Look-Up-Table (LUT) is employed to store P0 mod N and Pj ’ mod N terms; then (31) is implemented by two parts as depicted in Fig. 12. The first part accumulates P0 to implement the P0∙j term and the mod N block produces the correct modulo N result. The second part employs the two least significant bits of a counter (j−cnt) to select the proper Pj ’ mod N value, which is added to the (P0∙j) mod N term. A further modulo N operation is performed at the output. Since in this architecture both the first and the second part work on data belonging to [0, 2N−1], all the mod N operations are implemented by means of a subtracter and a multiplexer.

The second stage of the parallel CTC interleaver-deinterleaver architecture works as follows.

Since adxt [0, N/P-1], it can be obtained from the scrambled address i produced by the first stage as

The straightforward implementation of (33) needs to calculate N/P and to allocate P−2 multipliers, P−1 subtracters, a P-way multiplexer and few logic for selecting the proper adxt value. The N/P division can be simplified by choosing the possible P values as powers of two. Thus, we obtain a CTC decoder architecture that exploits throughput/parallelism scalability to avoid collisions, namely we employ: P=1 when N≤180, P=2 when 192≤N≤240 and P=4 when 480≤N≤2400. Moreover, as it can be inferred from Fig. 12, multiplications are avoided resorting to simple shift operations (x>>i=x/2^{i}). The sign of the subtractions (dashed lines in Fig. 12) allows not only to select the proper adxt but also to find idx0 t . Then, with P−1 modulo P adders the other idxk t values are straightforwardly generated. As it can be observed, choosing P as a power of two reduces the modulo P adders to simpler, binary adders. The actual throughput sustained by the described throughput/parallelism scalable architecture is represented by the bold line in Fig. 11.

The global architecture of the designed parallel SISO is given in Fig. 13 where each SISO contains the processors devoted to compute the different metrics required by the BCJR algorithm as detailed in section 2.3. A simple network is used to properly connect the SISOs according to the current value of P by setting the signal last_SISO. Furthermore, one address crossbar-switch (radx-switch) is used to implement the reading operation, a LIFO stores the address and makes them available for the writing phase, two data crossbar-switches (rdata-switch/wdata-switch) are used to properly send (receive) the data to (from) the memory (EI-MEM) according to the parallel interleaver idxk t values.

In Table 2 the complexity of all the blocks for a 130 nm standard cell technology is reported. The bit-width is: 6 bit for λ[c;I] , 8 bit for λ[u;I] , and 12 bit for the state metrics. For further details the reader can refer to Martina et al., 2009.

Architecture

SD

Subblock Deinterl.

SISOx1

Parallel Interl.

Logic [kgate]

11

1.7

37

2.8

Memory [kbit]

0

0

14.2

59

Table 2.

Complexity of the whole receiver.

Acknowledgments

This work is partially supported by the WIMAGIC project funded by the European Community.

Maurizio Martina and Guido Masera (December 1st 2009). VLSI Architectures for WIMAX Channel Decoders, WIMAX New Developments, Upena D Dalal and Y P Kosta, IntechOpen, DOI: 10.5772/8282. Available from:

We are IntechOpen, the world's leading publisher of Open Access books. Built by scientists, for scientists. Our readership spans scientists, professors, researchers, librarians, and students, as well as business professionals. We share our knowledge and peer-reveiwed research papers with libraries, scientific and engineering societies, and also work with corporate R&D departments and government entities.