Fast Fourier Transform Processors: Implementing Fft and Ifft Cores for Ofdm Communication Systems

The terms Fast Fourier Transform (FFT) and Inverse Fast Fourier Transform (IFFT) are used to denote efficient and fast algorithms to compute the Discrete Fourier Transform (DFT) and the Inverse Discrete Fourier Transform (IDFT) respectively. The FFT/IFFT is widely used in many digital signal processing applications and the efficient implementation of the FFT/IFFT is a topic of continuous research.


Introduction
The terms Fast Fourier Transform (FFT) and Inverse Fast Fourier Transform (IFFT) are used to denote efficient and fast algorithms to compute the Discrete Fourier Transform (DFT) and the Inverse Discrete Fourier Transform (IDFT) respectively. The FFT/IFFT is widely used in many digital signal processing applications and the efficient implementation of the FFT/IFFT is a topic of continuous research.
During the last years, communication systems based on Orthogonal Frequency Division Multiplexing (OFDM) have been an important driver for the research in FFT/IFFT algorithms and their implementation. OFDM is a bandwidth efficient multiple access scheme for digital communications (Engels, 2002;Nee & Prasad, 2000). Many of nowadays most important wireless communication systems use this OFDM technique: Digital Audio Broadcasting (DAB) (World DAB Forum, n.d.), Digital Video Broadcasting (DVB) (ETS, 2004), Wireless Local Area Network (WLAN) (IEE, 1999), Wireless Metropolitan Area Network (WMAN) (IEE, 2003) and Multi Band -OFDM Ultra Wide Band (MB-OFDM UWB) (ECM, 2005). Moreover, this technique is also employed in important wired applications such as Asymmetric Digital Subscriber Line (ADSL) or Power Line Communication (PLC).
OFDM systems rely on the IFFT for an efficient implementation of the signal modulation on the transmitter side, whereas the FFT is used for efficient demodulation of the received signal. The FFT/IFFT becomes one of the most critical modules in OFDM transceivers. In fact, the most computationally intensive parts of an OFDM system are the IFFT in the transmitter and the Viterbi decoder in the receiver (Maharatna et al., 2004). The FFT is the second computationally intensive part in the receiver. Therefore, the implementation of the FFT and IFFT must be optimized to achieve the required throughput with the minimum penalty in area and power consumption. The demanding requirements of modern OFDM transceivers lead, in many cases, to the implementation of special-purpose hardware for the most critical parts of the transceiver. Thus, it is common to find the FFT/IFFT implemented as aVery Large Scale Integrated (VLSI) circuit. The techniques applied to the FFT can be applied to the IFFT as well. Moreover, the IFFT can be easily obtained by manipulating the output of a FFT processor. Therefore, the discussion in this chapter concentrates on the FFT without loss of generality. 5 www.intechopen.com Different kinds of FFT algorithms can be found in the literature; e.g.: (Good, 1958;Thomas, 1963), (Cooley & Tukey, 1965), (Rader, 1968b), (Rader, 1968a), (Bruun, 1978) and (Winograd, 1978). Among the different kinds of FFT algorithms, the algorithms based on the approach proposed by James W. Cooley and John W. Tukey in (Cooley & Tukey, 1965) are very popular in OFDM systems. These Cooley-Tukey (CT) algorithms present a very regular structure, which facilitates an efficient implementation. The computations of the FFT is divided into log r (N) stages, where N i st h en u m b e ro fp o i n t so ft h eF F Ta n dr is called the radix of the algorithm. Within each stage, data shuffling and the so-called butterfly computation and twiddle factor multiplications are performed. Usually, the butterfly operations are all identical and the twiddle factor multiplications and data shuffling follow some kind of pattern. This regular structure makes them very attractive for VLSI circuit implementation. Different hardware architectures have been used in the literature for the implementation of the CT algorithms. The FFT hardware architectures can be classified into three groups: • Monoprocessor: A single hardware element is used to perform all the butterflies, twiddle factor multiplications and data shuffling of each stage. The same hardware is reused for all the stages. • Parallel: The computation of the butterflies, twiddle factor multiplications and data shuffling within one stage is accelerated by using several processing elements. The same hardware elements are again reused for all the stages. • Pipeline: A single hardware element is used to perform all the butterflies, twiddle factor multiplications and data shuffling of each stage. However, in contrast to former categories, a different hardware element is used to process each stage.
It is common in the literature to further classify the pipeline architectures according to the structure used for the shuffling into two basic types: Delay Commutator (DC) and Delay Feedback (DF). Also, according to the number of lines of data used in these pipeline architectures, they can be classified into Single-path (S) or Multiple-path (M) architectures.
Many different variations of CT algorithms have been proposed in the literature to improve different aspects of the implementation (memory resources, number of arithmetic operations, etc.) and their mapping to a specific hardware architecture. Table 1 summarizes the features of some FFT/IFFT processors for OFDM systems proposed in the literature. Proposals such as (Chang & Park, 2004;Serrá et al., 2004) employed a monoprocessor architecture to process the FFT. (Jiang et al., 2004;Lin, Liu & Lee, 2004) used parallel architectures and (Kuo et al., 2003) chose a cached memory monoprocessor architecture for the FFT processing in an OFDM system. However, the most widely used architectures for the FFT/IFFT processor in an OFDM system are pipeline architectures. (Cortés et al., 2007;He & Torkelson, 1998;Lee & Park, 2007;Lee et al., 2006;Turrillas et al., 2010;Wang et al., 2005) propose Single-Path Delay Feedback (SDF) architectures. (Lin et al., 2005;Liu et al., 2007) employ an Multi-Path Delay Feedback (MDF) architecture. (Bidet et al., 1995) proposes a Single-Path Delay Commutator (SDC) architecture and (Jung et al., 2005;Saberinia, 2006) chose an Multi-Path Delay Commutator (MDC) architecture. (Saberinia, 2006) called the MDC architecture as Buffered Multi-Path Delay Commutator (BRMDC) due to the buffers used in the input data. Analyzing the radix, r, of the algorithms employed in the literature, it can be observed that: • For the monoprocessor architectures, radix 2 (Serrá et al., 2004) and radix 4 (Chang & Park, 2004) algorithms have been used.
When the length of the FFT is not very large, the fixed-point format is the most widely used number representation format due to the area-saving with respect to the floating-point representation.
From Table 1, it can be seen that many different algorithms and architectures have been proposed for OFDM systems. The designer must select the most appropriate algorithm and the most efficient architecture for that algorithm, given the specifications of a certain OFDM system. This selection is a difficult task. There is not a clear algorithm/architecture winner. Therefore, the designer should explore different algorithms and architectures in the literature to find the optimal one for the specific OFDM application under development.
The typical way of expressing the FFT algorithms in the literature is by means of summations or flow graph notation. Examples of these representations can be found in (He & Torkelson, 1998;Lee & Park, 2007;Lee et al., 2006;Lin, Lin, Chen & Chang, 2004;Lin, Liu & Lee, 2004;Lin et al., 2005;Liu et al., 2007;Maharatna et al., 2004;Tsai et al., 2006). These representations do not help the designer to understand the algorithm fast, therefore, making it difficult to relate it to its HW resources. Additionally, sometimes there is a lack of a general expression for the algorithm/architecture, which makes it harder to adapt and evaluate for a different OFDM system. Therefore, these representations are not practical for design space exploration. What the designer needs for efficient design space exploration is a general expression for the different design parameters of the algorithms which is easy to understand and makes it fast to map to hardware resources.
In (Pease, 1968), a matrix notation to express the FFT is proposed. Different FFT algorithms are obtained combining a reduced set of operators to simplify the implementation of parallel processing in a special-purpose machine. In (Sloate, 1974), H. Sloate used the same approach as (Pease, 1968) to demonstrate how several FFT algorithms previously defined could be derived using the matricial expressions. Additionally, he analyzed some new algorithms and worked out how to relate the matricial expressions to their implementation. However, that notation is not generalized for pipeline architectures. Recently, (Cortés et al., 2009) generalized the above approach presenting a unified approach for radix r k pipeline SDC/SDF FFT architectures. Radix r k pipeline FFT architectures are very efficient architectures that are well suited for OFDM systems.
This chapter reviews the matricial representation of radix r k pipeline SDF FFT architectures. Thus, a general expression in terms of the FFT design parameters that can be linked easily to hardware implementation resources is presented. This way, the designer of FFT/IFFT processors for OFDM systems is provided with the tools for efficient design space exploration. The design space exploration and the optimal architecture selection procedure is illustrated by means of a case study. The case study analyzes the FFT/IFFT processor in a WLAN IEEE 802.11a transceiver. The high level analysis proposed in (Cortés et al., 2009) is extended to implementation level to select the most efficient FFT/IFFT core in terms of area and power consumption.

A unified matricial approach for radix r k FFT SDF pipeline architectures
This section presents the matricial representation of radix r k Decimation In Frequency (DIF) SDF pipeline architectures. First, the DFT matrix factorization procedure that leads to the FFT algorithms is reviewed. This review is used to define the basic types of matrices needed for the FFT. Next, in order to simplify the notation and latter mapping of the matricial representation to hardware resources, some operators are defined. Then, the general expression for radix r k FFT SDF pipeline architectures is presented and the mapping to hardware resources illustrated.

Review of the DFT matrix factorization
The DFT can be expressed matricially as, In order to factorize T N , (Sloate, 1974) defines where r is the radix of the algorithm, p = N/r is a positive integer (p ∈ N * )a n dP (r) N is the stride permutation matrix. This matrix, P (r) N , is defined by its effect on a vector: where I z is the identity matrix of size z.T h es y m b o l⊗ represents the Kronecker product. Given an m × n matrix A and a matrix B, the Kronecker product is defined as (7) shows how to write a DFT matrix in terms of smaller DFT matrices. This process can be repeated recursively until a expression in terms of T r is arrived. Equation (7) is a matricial representation of the decomposition technique used in the well known Cooley-Tukey FFT (Cooley & Tukey, 1965).

Definition of operators
In order to simplify the notation, nine operators are defined. Two different types of operators can be distinguished: the reordering operators and the arithmetic operators.
• Reordering operators -Shuffling Operator S (a) :L e tN be divisible by r a , -Butterfly Operator B:LetN be divisible by r, -Second Twiddle Factor Multiplier Operator M2 (a) : Let N be divisible by r a+2 , -Third Twiddle Factor Multiplier Operator M3 (a,b) :LetN be divisible by r a+b and b ≥ 2,

Matricial representation of radix r k pipeline FFT architecture
The decomposition procedure given in equation (7) can be applied recursively to devise the r k SDF pipeline architectures. Let N = r (kn k +l) with {k, n k }∈N * and l ∈{0,1,...,k − 1},the matricial representation of r k DIF SDF pipeline architectures is given by where p k = r k(n k −1) .T h et e r mH (b,k,i) represents the i th stage of a r k FFT algorithm. When b = 1, H (1,k,i) is given by, When b = 2, H (2,k,i) reduces to, where V (b,k·i) is given by When b is even and b > 2, H (b,k,i) is given by, where V (b,k·i) is given by (19) and G (b,k·i,m) , by (22).
where G (b,k·i,m) is given by (22).
When N = r kn k +l with l > 0, (14) means that, after n k stages with r k structure, an additional processing stage with the structure of a r l algorithm is required.
The above matricial representation is general in terms of N, r and k. The Decimation In Time (DIT) version of the architecture can be easily obtained by transposing the expressions.

Mapping to pipeline architectures
The structure of typical stages r k pipeline SDF architecture is depicted in Figure 1. It can be observed in the first row of Figure 1 that the output of one stage is connected to the input of the next stage. Within each stage, there is a sequence of hardware processing elements which can perform the computations defined by the operators presented in Section 2.2 as illustrated in the second and third rows of Figure 1. The hardware required to implement the computations demanded by each operator is discussed in (Cortés et al., 2009). It is important to note that the shuffling operators surrounding the butterfly operators can be merged in the single device with feedback typical of the SDF architecture. To illustrate this, in the fourth row of Figure 1 the implementation of radix 2 k algorithms is presented.
Each stage H (k,k,i) consists of k butterfly operators B, with their corresponding shuffling and unshuffling S and (S) −1 terms. The hardware that implements the arithmetic of the butterfly operators B is the same in all the stages of the FFT processor. The implementation of the butterfly depends on the value of r. The length of each of the delay lines used for the shuffling and unshuffling corresponding to the first butterfly of the first stage is N/r. The length of each of these delay lines is reduced by a factor of r from one butterfly to the next one as is shown in the fourth row of Figure 1. The number of delay commutator structures used for the shuffling and unshuffling around a butterfly unit is the same as the number of butterfly units.
Operators M1, M2 and M3 mainly translate to complex multipliers (C.M.) and Look-Up Tables (LUTs) to store the twiddle factors (T.F.). A stage of the form H (k,k,i) has one twiddle factor multiplier operator M1. The number of twiddle factors stored in the LUT used to implement M1 is N for the first stage and it reduces by a factor of r k from one stage to the next. When N = r kn k , no complex multiplier is needed to implement the twiddle factor multiplications given by M1 (k(n k −1),k) of the last stage H (k,k,n k −1) ,becauseM1 (k(n k −1),k) = I N . The final processing given by terms of the form H (l,k,n k ) when N = r kn k +l ,w i t hl = 0, does not actually have a operator M1; i.e.: M1 (kn k ,k) = I N .
For k > 1, twiddle factor multiplier operators M2 appear. A stage H (b,k,i) has b/2 operators M2 when b is even and (b − 1)/2 when b is odd. The number of twiddle factors stored in the LUT used to implement M2 is always r 2 .
For k > 2, twiddle factor multiplier operators M3 appear. A stage H (b,k,i) has b/2 − 1 operators M3 when b is even and (b − 1)/2 when b is odd. The number of twiddle factors stored in the LUT used for the implementation of the first M3 within a stage H (b,k,i) is r b and it reduces by a factor of r 2 from one M3 to the next one within the same stage.

Hardware resources of r k pipeline architectures
In (Cortés et al., 2009), the complexity of the r k algorithms in terms of area was analyzed. In this section, some of their conclusions are summarized.
For a given N, the designer can vary the parameters r and k to achieve different implementations. These parameters influence the reordering operators and the arithmetic operators.
The overall amount of memory words used for the delay lines due to the reordering operators depends neither on r nor k. In an SDF architecture, the amount of memory is N − 1w o r d s . From the general expressions, the designer has to look for those values of r and k that result in an optimum single-path FFT pipeline architecture for a given application. This search can be easily performed thanks to the close link between the proposed matricial representation and implementation. No other notation allows such an exploration within the pipeline architectures.

Case study: WLAN IEEE 802.11a
In this section, a case study is analyzed applying the proposed design space exploration in order to achieve the most efficient algorithm/architecture for an OFDM system. This design space exploration does not only analyze the hardware complexity of the algorithms as in (Cortés et al., 2009). Additionally, an implementation level analysis is carried out also taking into account the power consumption which is very important for wireless devices.
The main objective of this design space exploration is to select the most efficient radix r k pipeline SDF FFT architecture for the OFDM application. An FFT for the WLAN IEEE 802.11a standard has been selected as the case study. In this case, the length of the FFT/IFFT is 64 points. According to (Cortés et al., 2009), the optimum value of k is near 3. Therefore, the following analysis concentrates on r = 2 2 , r = 2 3 and r = 2 4 algorithms in order to perform the implementation level analysis and to study the silicon area and power consumption results.
The proposed design space exploration can be divided into four steps. At first, the OFDM system described by the standard is analyzed to extract the specifications. Once the OFDM specifications are known, the designer focuses on the system and on the FFT/IFFT specifications to determine the parameters needed for the FFT/IFFT design. The next step is the implementation level analysis of different FFT/IFFT algorithms and architectures in order to select the most efficient core for the WLAN system according to a given criterion. This step is composed of two main analysis: the Error Vector Magnitude (EVM) analysis in the transmitter and the Carrier-To-Noise Ratio (CNR) analysis in the receiver. After these analysis, the data bitwidth (dbw) and the twiddle factors bitwidth (tbw)a r ech osensoa sn ot to degrade the system performance. Finally, the layout of the most efficient FFT/IFFT core for WLAN 802.11a in terms of area and power consumption is shown. Table 2 shows the parameters specified in IEEE 802.11a standard for a WLAN system (IEE, 1999). N c = 64 sub-carriers are used. N p = 4 sub-carriers are used as pilot tones to make the coherent detection more robust against frequency offsets and phase noise. These pilot tones are always in sub-carriers -21, -7, 7 and 21 and they are modulated in Binary Phase-Shift Keying (BPSK). N d = 48 tones are employed as data sub-carriers and the rest of sub-carriers are zero. The length of the guard interval must be longer than the delay spread of the channel. Considering an indoor environment, the necessary guard interval in a WLAN system is 0.8 μs. Therefore, the last N gi = 16 data must be copied at the beginning of the OFDM symbol. The maximum data rate is R = 54 Mbps. Additionally, the bandwidth of the system is BW = 20 MHz. The standard also determines that the OFDM symbol period is t sym = 4μs. In IEEE 802.11a, data can be modulated in BPSK, Quadrature Phase-Shift Keying (QPSK), 16-Quadrature Amplitude Modulation (QAM) or 64-QAM.

Study of the IEEE 802.11a standard
The EVM is defined as: where I + jQ is the measured symbol and I o + jQ o is the transmitted symbol. Figure where EVM(i, j, k) is the magnitude of the error vector for the k th sub-carrier of the j th OFDM symbol of the i th frame. The EVM is the quality constraint that must not be degraded in the transmitter. The EVM value in Table 2 is the minimum EVM in a WLAN system when the system transmits data modulated in 64-QAM with the maximum data rate of 54 Mbps.
The IEEE 802.11a standard specifies the spectral mask of the output signal in the transmitter. Figure 3 presents the spectral mask of the IEEE 802.11a transmitter output where this height is H/N o = 40dB. The quality constraint selected for the receiver is the non-degradation of the CNR when the input signal to the receiver has the spectral mask specified by the standard. This is a first approach that produces an overconstrained core. EVM and H/N o specifications are used to determine the allowed quantization error.

System analysis and FFT/IFFT specifications
The goal of this analysis is to determine the parameters of Table 3. Table 3 presents five important system specifications which influence significantly the design of the FFT/IFFT: the length of the FFT/IFFT N, the system clock frequency f clk ,thevalueofK tx in the transmitter, the value of K AGC and the CNR in the receiver. As the OFDM system supports 64 sub-carriers, the FFT core must be designed to perform 64-points FFT; i.e.: N = 64. As shown in Table 2, the OFDM symbol period (t sym ) in a WLAN system is equal to 4 μs. Following the analysis of (Velez, 2005), f clk = 60 MHz is selected, higher than the maximum data rate R and multiple of the BW. Maximum gain to improve the DAC performance K AGC Maximum gain to improve the FFT performance CNR max Maximum carrier-to-Noise Ratio Table 3. System specifications

k tx in the transmitter
In order to select K tx , an analysis of the IFFT output data must be carried out for each OFDM system. For this analysis, the modulation where the data reach the highest values in the I and Q data must be used.
In this analysis, K tx is estimated by means of simulations. The transmission of the number of frames determined by the standard is simulated for different values of K tx to calculate the EVM. Then, the largest K tx that does not degrade the EVM can be selected. This way, the whole dynamic range of the Digital-to-Analog Converter (DAC) is exploited. Figure 4 shows the system model used to analyze the EVM degradation with the increase of the value of K tx . The system model is composed of a floating-point IFFT in the transmitter and a floating-point FFT in the receiver. The effect of the clipping in the DAC is emulated by a limiter located after the multiplication by K tx .T h i slimiter is responsible for saturating the amplified IFFT output according to the integer bits of the data representation. Two integer bits are considered for both the input and output data in the IFFT, since the maximum value of the input data is 1.08 when the modulation is 64-QAM. In order to improve the dynamic range of the IFFT output, the data are amplified by a factor K tx .Afa c t orK tx ,whichisapower of two, is preferred to simplify the hardware implementation. In (Velez, 2005), the overflow probability of the different types of modulations in WLAN is analyzed carrying out Monte Carlo simulations with 3 ≤ K tx ≤ 10 . It is concluded that BPSK modulation is the most sensitive to the clipping at the DAC. (Velez, 2005) also concluded that the most suitable value  Fig. 4. System model to study the effect of K tx in EVM for K tx is K tx = 4 so as not to degrade the system performance. Additionally, K tx = 4i s implemented as a simple shift in a real implementation. Figure 5 shows the effect of the factor K tx on the EVM in the transmitter, that is, how the EVM is degraded by the clipping produced in the IFFT. If a factor K tx = 8 were selected, the  EVM value would not meet the standard specification. It can be observed that K tx = 4i sa conservative decision, since the EVM specified by the standard for a BPSK modulation with a data rate of 9 Mbps is -8 dB, and the EVM obtained with K tx = 4 is -61.95 dB.

k AGC and CNR max in the receiver
An Automatic Gain Control (AGC) module is a block found in many electronic devices. The AGC is commonly used to dynamically adjust the gain of the receiver amplifier to keep the received signal at the desired power level.
In this analysis, the effect of the AGC is modeled as a gain K AGC . For the analysis, the value of K AGC that does not degrade the CNR, is selected. This value is determined by means of simulations.  Before applying the AWGN channel, the power of the output signal of the IFFT, X i ,i s normalized to 1 dividing the signal by the factor λ,whichisdefinedas In Figure 6, Y is the signal after applying the AWGN channel. At the receiver, this signal is multiplied by the gain K AGC and a limiter is applied. This limiter is responsible for limiting the amplified signal according to the number of integer bits of the data representation. Once the limiter is applied, Y q is the input of the floating-point FFT and y q is the output of the floating-point FFT. y i is the output of the floating-point FFT without noise and without applying the limiter. For the simulations, it is necessary to determine the variance of the noise to be applied to the transmitted signal so that the signal meets the spectral mask with the height H/N o given by the standard. The Signal-To-Noise Ratio (SNR) of the transmitter output can be calculated from the height of the spectral mask H/N o given by the standard. The power of the output signal of the transmitter can be written as, where it is assumed that the power of the data sub-carriers is equal to the power of the pilot sub-carriers. The noise power is given by: Therefore, Then, the variance σ 2 of the noise in the channel is found to be, In order to select the value of the factor K AGC ,t h eCNRisuseda safi gur eofmerit . CNRis defined as, whereP i is the power of y i andP n is the power of the difference between y i and y q .
The gain K AGC is selected so that the CNR is not degraded. Factor K AGC fixes the power of the signal Y q ; i.e.: P Y q . The figures to analyze the effect of K AGC on CNR plot the CNR versus P Y q , which is a more representative value.
The CNR is obtained by means of Monte Carlo simulations. This method allows the designer to estimate a measure of the performance of a communication system and the quality of the estimation itself (Sevillano, 2004). The looseness of our estimation in order to achieve a confidence interval equal to 0.95 is defined as, Therefore, it can be said that the probability of CNR belonging to the interval [(1 − γ)ĈNR, (1 + γ)ĈNR] is 0.95. The Monte Carlo simulations are stopped when γ < 10 −2 .
It is assumed that the signal arrives with the maximum quality H/N o = 40 dB. Therefore, applying (28), the SNR of the transmitted signal is of 39.09 dB. Figure 7 shows the (CNR) dB versus the power of the input signal of the FFT P Y q .T h es i g n a l works without degradation with P Y q < −4 dB, which corresponds to K AGC = 6i nt h e simulations. For this value of K AGC , (CNR max ) dB is 39.1 dB.

Summary of results
In this step, the length of the FFT N, the system clock frequency, f clk , the gain factor in the transmitter, K tx , and the gain factor in the AGC, K AGC and the CNR max have been estimated. Table 4 presents the necessary OFDM parameters for the next step.

Selection of algorithm/architectures
A high throughput FFT core is needed to fulfil the required specifications. Pipeline architectures are well suited to achieve small silicon area, high throughput, short processing time and reduced power consumption.  of six butterflies and two complex multipliers. The pipeline-SDF radix 2 3 DIF architecture is composed also of six butterflies, but one complex multiplier and two constants multipliers by one constant are used. The pipeline-SDF radix 2 4 DIF architecture is formed by six butterflies, one complex multiplier and one constant multiplier by two constants. Therefore, the three algorithms have the same number of multipliers taking into account normal and constant multipliers. The radix 2 3 and 2 4 DIF algorithms reduce ROM with respect to the radix 2 2 DIF one, but they add control logic.
To sum up, at this point it is not clear which is the most efficient architecture for WLAN. Therefore, it is necessary to make an implementation level analysis. The hardware complexity of radix 2 2 ,2 3 and 2 4 DIF algorithms is compared to search the optimum design that meets the system specifications. The EVM constrains the word-length of the IFFT in the transmitter. The word-length in the receiver is constrained by the CNR. If the transmitter and the receiver are implemented in the same chip, the highest word-lengths must be chosen. The figures of merit to select the algorithm and architecture are the area and the power consumption estimated for an Application-Specific Integrated Circuit (ASIC) technology. Thus, this selection process can be stated as the problem of finding the FFT/IFFT processor which minimizes the AP cost function subjected to the constraints given by the specifications. The AP criterion trades off area and power consumption and can be used as a measure of the efficiency of the core. The area and power results presented in the following sections have been calculated for a TSMC 90 nm 6 ML technology with a clock frequency of 60 MHz working at 1.0V and a temperature of 25 o C. The area results have been estimated multiplying the cell area by a factor of 2.

EVM analysis
In order to analyze the effect of the IFFT quantization error on the EVM during transmission, an ideal reception is considered. The system model is composed of a fixed-point IFFT at the transmitter and a floating-point FFT at the receiver as can be seen in Figure 8(a). The EVM value is employed to select the values of dbw and tbw needed in the transmitter. An EVM margin of -10 dB is used in order to select a conservative word-length. Thus, a margin is left for other sources of error, such as the error produced by the analog processing. In order to The pipeline-SDF with the radix 2 2 ,2 3 and 2 4 DIF algorithms have been studied to find the most efficient implementation. For radix 2 3 and 2 4 algorithms, the bitwidth of the constant multiplier is assumed to be equal to the bitwidth of the normal multiplier. Figure 9 shows the EVM and the area results of the pipeline-SDF DIF 64-point FFT/IFFT core for different algorithms. In order to guarantee an EVM of at least -35 dB, the radix 2 2 algorithm requires a (dbw,tbw) of (12,8), whereas the radix 2 4 and 2 3 algorithms need (12,7). For the given EVM, the radix 2 4 algorithm achieves the smallest core with (dbw, tbw)=(12, 7).
The EVM and the power results of the pipeline-SDF DIF 64-point FFT/IFFT core are presented in Figure 10. It can be observed that the radix 2 2 algorithm requires less power consumption for the same (dbw, tbw) than the rest of algorithms, whereas the radix 2 3 algorithms consume more than the others.
where P is the power consumption with a voltage of V DD and working at f clk . By slightly modifying the expression given by (Kuo et al., 2003) as follows, The r2 2 algorithm needs larger bitwidths to achieve the target EVM. This extra bit in tbw increases the area needed. The r2 3 and r2 4 algorithms can achieve the required EVM with smaller tbw.T h er2 4 algorithm with dbw = 12 and tbw = 7 achieves the most area-efficient implementation that fulfills the EVM specification. Nevertheless, the power consumption of the r2 4 algorithm is higher than the power consumption of r2 2 algorithm. In fact, the r2 2 algorithm is the most power-efficient design. In order to achieve a trade-off, the parameter AP = Area · P norm is employed since it takes into account the area and the power consumption. The AP parameter trades off the area and power consumption of the core and, thus, it measures the efficiency of the design. For WLAN transmitter, it can be observed in Table 6 that the most efficient cores are the ones using r2 2 and r2 4 algorithms.

CNR analysis
After the EVM analysis, the r2 3 algorithm is discarded. Therefore, the CNR analysis focuses on the r2 2 and r2 4 algorithms. At this point, the area and power results of the FFT/IFFT core for different bitwidth configurations are already known. Then, the CNR analysis is used to select the dbw and tbw which fulfills the CNR specification.
In order to analyze the effect of the FFT quantization, the simulation model is formed by a floating-point IFFT at the transmitter and a fixed-point FFT at the receiver as is shown in Figure 8(b). First, only data are quantized. In this case, a figure of the CNR versus dbw can be used in order to select the dbw which does not degrade the CNR. Once dbw is selected, the twiddle factors are also quantized and a figure of the CNR versus tbw i ss h o w ni no r d e rt o choose the tbw which does not degrade the CNR. This analysis is done for the candidates in order to determine the necessary dbw, tbw to comply with the CNR specification.
As an example, figures of (CNR) dB versus dbw and tbw are given for two algorithms. Figure 11(a) shows the (CNR) dB obtained versus the dbw parameter. In this case, the twiddle factors are not quantized. From the figure, a data bitwidth of dbw = 15 is selected to avoid degrading the (CNR) dB for both radix 2 2 and radix 2 4 algorithms. Once the data bitwidth is chosen, the twiddle factors are quantized. Figure 11(b) presents the (CNR) dB versus tbw where dbw = 15. In this case, a twiddle factor bitwidth of tbw = 10 is selected for both radix 2 2 and radix 2 4 algorithms. It can be observed that increasing more tbw does not improve the performance of the core. Table 7 summarizes the (dbw, tbw) needed by the FFT to comply with the CNR requirement. Comparing the bitwidths needed by the IFFT in the transmitter to comply with the EVM and the ones needed by the FFT in the receiver to comply with the CNR, it can be said that the CNR is a much more restrictive specification. Therefore, the (dbw, tbw) selected for the FFT are the bitwidths used in the FFT/IFFT core. Table 7 presents the AP results of the FFT algorithms with the necessary bitwidths (dbw, tbw) to comply with the specifications. Taking into account the AP, the most efficient core for a WLAN system is the pipeline-SDF radix 2 2 DIF architecture.  To sum up, the parameters of the final implementation of the pipeline-SDF radix 2 2 DIF architecture are shown in Table 8. As mentioned before, the system clock frequency is 60 MHz. Working at this frequency, the processing time of the chosen core is 0.67 μs. The data format is 2.13, whereas the format of the twiddle factors is 1.9. The EVM is -51.92 dB and the (CNR) dB is 38.57 dB. The estimated silicon area of the pipeline-SDF radix 2 2 DIF FFT/IFFT core is 0.1284 mm 2 and the power consumption estimation P norm is 0.0277. It can be observed that the EVM complies with the specification in Table 4, and the CNR is close to the CNR max .
N dbw tbw f clk t proc CNR EVM 64 15 10 60 MHz 0.67 μs 38.57 dB -51.92 dB Table 8. Core Parameters for the pipeline-SDF radix 2 2 DIF architecture Figure 12 shows the layout of the 64 complex-point FFT/IFFT fabricated in a 90 nm TSMC technology, 6-ML CMOS process. The core size is 0.1362 mm 2 . In the previous section, the core area was estimated to be 0.1284 mm 2 . It can be said that the area estimation is accurate enough. In order to present a comparison with the proposals found in the literature, the area of the cores is normalized as (B.M. Baas, 1999) using the equation: where T a is the anchor of the transistor of the technology actually used, A is the occupied area for T a and T b is the technology for which the area is normalized.

Layout of the FFT/IFFT core in an ASIC
In order to assess the quality of the FFT/IFFT core, Table 9 makes a comparison with other 64 complex-point FFT/IFFT cores found in the literature. In Table 9, the area (A norm )a n d the power (P norm ) have been normalized to a 90 nm technology using (34) and (33). The AP parameter indicates that our core is the most efficient one for a WLAN application. In fact, the presented core requires the smallest normalized power (P norm ).

Conclusions
Many different FFT/IFFT algorithms and architectures have been proposed in the literature for OFDM systems as has been presented in Section 1. Additionally, the usual FFT notations do not facilitate to perform a general analysis for the FFT/IFFT algorithm and architecture selection.
In Section 3, a design space exploration among different algorithms has been carried out. This search is hard to perform if general expressions are not available for the different algorithms in a unified way and if a mapping to the implementation can not be easily established. In this chapter, the matricial notation summarized in Section 2 is used as a tool to help the designer in this search. The OFDM parameters obtained from the IEEE 802.11a standard analysis have been employed as constraints for the optimization problem. The AP parameter, which trades off area and power consumption, has been used as a measure of the efficiency of the core. Finally, a pipeline-SDF radix 2 2 DIF FFT/IFFT processor has been proposed, since it achieves the minimum of the AP cost function.
To sum up, it can be concluded that there is no unique FFT/IFFT algorithm, architecture and implementation that is optimal for all OFDM systems. Therefore, it is recommended to perform a search across the algorithm, architecture and implementation dimensions for each OFDM system. The matricial notation is presented in this chapter as a unified and compact representation that can help the designer in this search. This search is feasible, and the FFT/IFFT cores implemented using this approach present a great efficiency. The field of signal processing has seen explosive growth during the past decades; almost all textbooks on signal processing have a section devoted to the Fourier transform theory. For this reason, this book focuses on the Fourier transform applications in signal processing techniques. The book chapters are related to DFT, FFT, OFDM, estimation techniques and the image processing techqniques. It is hoped that this book will provide the background, references and the incentive to encourage further research and results in this area as well as provide tools for practical applications. It provides an applications-oriented to signal processing written primarily for electrical engineers, communication engineers, signal processing engineers, mathematicians and graduate students will also find it useful as a reference for their research activities.