Open access

Acceleration of Computation Speed for Wavefront Phase Recovery Using Programmable Logic

Written By

Eduardo Magdaleno and Manuel Rodríguez

Submitted: March 23rd, 2011 Published: January 20th, 2012

DOI: 10.5772/30809

Chapter metrics overview

2,294 Chapter Downloads

View Full Metrics

1. Introduction

Atmospheric turbulence introduces optical aberration into wavefronts arriving at ground-based telescopes. Current adaptive optics (AO) systems use vector-matrix-multiply (VMM) reconstructors to convert gradient measurements to wavefront phase estimates. Until recently, the problem of an efficient phase recoverer design has been implemented over PC or GPU platforms. As the number of actuators n increases, the time to compute the reconstruction by means of the VMM method scales as O(n 2 ). The number of actuators involved in AO systems is expected to increase dramatically in the future. For instance, the increase in the field of astronomy is due to increasing telescope diameters and new higher-resolution applications on existing systems. The size increase ranges from hundreds up to tens of thousands of actuators and requires faster methods to complete the AO correction within the specified atmospheric characteristic time. The next generation of extremely large telescopes (with diameters measuring from 50 up to 100 meters) will demand important technological advances to maintain telescope segment alignment (phasing of segmented mirrors) and posterior atmospheric aberrations corrections. Furthermore, an increase in telescope size requires significant computational power. Adaptive optics includes several steps: detection, wavefront phase recovery, information transmission to the actuators and their mechanical movements. A quicker wavefront phase reconstruction appears to be an extremely relevant step in its improvement. For this reason other hardware technologies must be taken into account during the development of a specific processor.

In recent years, programmable logic devices called FPGA (Field Programmable Gate Array) are seriously taken into account like a technological alternative in those fields where fast computations are required. The FPGA technology makes the sensor applications small-sized (portable), flexible, customizable, reconfigurable and reprogrammable with the advantages of good customization, cost-effectiveness, integration, accessibility and expandability. Moreover, an FPGA can accelerate the sensor calculations due to the architecture of this device. In this way, FPGA technology offers extremely high-performance signal processing and conditioning capabilities through parallelism based on slices and arithmetic circuits and highly flexible interconnection possibilities (Meyer-Baese, 2001 and Craven & Athanas 2007). Furthermore, FPGA technology is an alternative to custom ICs (integrated circuits) for implementing logic. Custom integrated circuits (ASICS) are expensive to develop, while generating time-to-market delays because of the prohibitive design time (Deschamps 2006). Thanks to computer-aided design tools, FPGA circuits can be implemented in a relatively short space of time. For these reasons, FPGA technology features are an important consideration in sensor applications nowadays.

In particular, the algorithm of the phase recoverer is based on the estimation of Fast Fourier Transforms (FFT) (Roddier & Roddier, 1991). For this reason, we have to do a careful choice of the architecture of the FFT. An efficient design of this block can result in substantial benefits in speed in comparison with the GPU, DSP or CPU solution.

This chapter presents the design of a fast enough wavefront phase reconstruction algorithm that is based on FPGA technology, paving the way to accomplish the extremely large telescope’s (ELT) number of actuators computational requirements within a 6 ms limit, which is the atmospheric response time. The design was programmed using the VHDL hardware description language and XST was used to synthesize into Virtex-6 and Virtex-7 devices. The FPGA implementation results almost 30 times faster than using GPU technology for a 64x64 phase recoverer. The chapter presents a comparative analysis of used resources for several FPGAs and time analysis.


2. A descriptive approach to the wavefront phase recovery

The Shack-Hartmann wavefront sensor samples the signal Ψ telescope (u,v) (complex amplitude of the electromagnetic field) to obtain the wavefront phase map: Φ(u,v). This is only possible if the sampling is done by microlenses or subpupils with r 0 dimension (that is, inside the phase coherence domain). An array of microlenses is used to sample the wavefront. This array is a rigid piece that fixes the sampling rate. Each (i,j) microlense produces a spot (figure 1):

I i j ( x , y ) = | F F T [ ψ t e l e s c o p e i j ( u , v ) ] | 2 E1

Figure 1.

Diagram of the light travel across a Shack-Hartmann sensor to measure the phase of the wavefront in the telescope pupil

The displacements d ij of the spots centroid with regard to the references centroid (associated with a plane wavefront phase), are a proportional estimation of the average subpupil phase gradient:

d i j = K φ s u b p i j r E2

Where r is the position with components (u,v), and K is a constant. K depends on the wavelength and the focal distances of the telescope, reimaging lens and microlenses. From these gradients estimation, the wavefront phase Φ(u,v) can be recovered using an expansion over complex exponential polynomials (Poyneer et al., 2002):

φ ( u , v ) = p , q = 0 N 1 a p q Z p q ( u , v ) = p , q = 0 N 1 a p q 1 N e 2 π i N ( p u + q v ) = I F F T ( a p q ) E3

The gradient is then written:

S ( u , v ) = φ ( u , v ) = φ u i + φ v j = p , q a p q Z p q E4

Making a least squares fit over the F function:

F = u , v = 1 N [ S ( u , v ) p , q a p q ( Z p q u i + Z p q v j ) ] 2 E5

where S are experimental data, the coefficients, apq, of the complex exponential expansion in a modal Fourier wavefront phase reconstructor (spatial filter), can be written as:

a p q = i p F F T { S x ( u , v ) } + i q F F T { S y ( u , v ) } p 2 + q 2 E6

The phase can then be recovered from the gradient data by transforming backward those coefficients:

φ ( u , v ) = F F T 1 [ a p q ] E7

A filter composed of three Fourier transforms therefore must be calculated to recover the phase. In order to accelerate the process, an exhaustive study of the crucial FFT algorithm was carried out which allowed the FFT to be specifically adapted to the modal wavefront recovery pipeline and the FPGA architecture.


3. A descriptive approach to the FPGA architecture

A FPGA device is essentially a matrix of logic cells (called slices). These slices are connected among themselves and with input/output blocks (IOB) through routing channels. These channels are distributed in the FPGA in horizontal and vertical form and its connexions are fixed using a programmable switch matrix (figure 2).

Figure 2.

FPGA generic architecture

Each slice is formed by look-up tables (LUT) and several flip-flops and programmable multiplexers (figure 3). The number and features of these elements depend on the FPGA family. Slices can implement memory (called distributed memory) or simple logic functions. Complex functions and parallel circuits are implemented when they are associated among themselves. Furthermore, FPGAs contain specific RAM blocks (BRAM) and arithmetic circuits.

Figure 3.

Block diagram of a slice

In particular, FPGAs allow a wide variety of computer arithmetic implementations for the desired digital signal processing algorithms like FFT, because of the physical bit-level programming architecture. FFT algorithm can be parallelized and we can implement a full pipeline architecture using FPGA. This feature contrasts with DSPs and GPUs, with the fixed multiply accumulator core.

In addition, Xilinx Virtex FPGA devices incorporate DSP48 arithmetic modules. Each DSP48 slice has a two-input multiplier followed by multiplexers and a three-input adder/subtracter. The multiplier accepts two 18-bit, two's complement operands producing a 36-bit, two's complement result. The result is sign extended to 48 bits and can optionally be fed to the adder/subtracter. The adder/subtracter accepts three 48-bit, two's complement operands, and produces a 48-bit two's complement result. Two DSP48 slices, a shared 48-bit C bus, and dedicated interconnect form a DSP48 tile (figure 4).

The DSP48 slices support many independent functions, including multiplier, multiplier-accumulator (MAC), multiplier followed by an adder, three-input adder, barrel shifter, wide bus multiplexers, magnitude comparator, or wide counter. The architecture also supports connecting multiple DSP48 slices to form wide math functions, DSP filters, and complex arithmetic without the use of general FPGA fabric (Hawkes, 2005).

Figure 4.

Two DSP48 arithmetic modules with a shared 48-bit C bus


4. Design of an efficient one-dimensional FFT

The discrete Fourier transform (DFT) of an N-point discrete-time complex sequence x(n), indexed by n=0, 1, N-1, is defined by

X ( k ) = n = 0 N 1 x ( n ) W N k n , k = 0,1, N 1 E8

where W N = e j 2 π / N and is referred to as the twiddle factor. The number of complex multiply and add operations for the computation of an N-point DFT is of order N2 but the problem is alleviated with the development of special fast algorithms, collectively known as fast Fourier transforms (Cooley & Tukey, 1965). These algorithms reduce the number of calculations to Nlog2N.

In the decimation-in frequency (DIF), the FFT algorithm starts with splitting the input data set X(k) into odd- and even-numbered points,

X ( k ) = n e v e n x ( n ) W N k n + n o d d x ( n ) W N k n = = m = 0 ( N / 2 ) 1 x ( 2 m ) W N 2 m k + m = 0 ( N / 2 ) 1 x ( 2 m + 1 ) W N k ( 2 m + 1 ) E9

So, the problem may be viewed as the DFT of N/2 point sequences, each of which may again be computed through two N/4 point DFTs and so on. This is illustrated in the form of a signal flow graph as an example for N=8 in figure 5.

Figure 5.

Computation of FFT by decimation in frequency for N=8

Finally, for the radix-2 FFT algorithm, the smallest transform or butterfly (basic computational unit) used is the 2-point DFT.

Generally, each butterfly implies one complex multiplier and two complex adders. In particular, multipliers consume much silicon area of FPGA because they are implemented with adder trees. Various implementation proposals have been made to save area removing these multipliers (Zhou et al. 2007, Chang & Jen, 1998, Guo, 2000 and Chien et al., 2005). Instead, DSP48 circuits allow some internal calculations of the Fourier transform algorithm or the filter of the phase reconstruction to be parallel, such as in complex multiplications. In this way, the use of these components accelerates the FFT calculation in comparison with a sequential algorithm or adder trees solutions. A complex pipeline multiplier is implemented using only four DSP48 (figure 6) to calculate:

P r e a l = A r e a l B r e a l A i m a g B i m a g P i m a g = A i m a g B r e a l + A r e a l B i m a g E10

Figure 6.

Pipeline, complex multiplier with four parallel real multipliers

The real and imaginary results use the same DSP48 slice configuration with the exception of the adder/subtracter. The adder/subtracter performs subtraction for the real result and addition for the imaginary one. This implementation only needs four clock cycles to calculate the complex multiplication with up to 550 MHz in XC4VSX35 Virtex-4 (Hawkes, 2005).

The complete pipeline radix-2 butterfly can be easily implemented with this specialized multiplier. It is necessary to use a FPGA Look-Up Table (LUT) (configured as SRL16 3-bits shift register) to preserve the synchronism. The butterfly implemented is depicted in Figure 7 and it needs only seven clock cycles to carry out the computation.

Figure 7.

Pipeline radix-2 butterfly in FPGA

A pipeline radix-2 FFT can be implemented using one butterfly in each stage. The twiddle coefficients used in each stage are stored in twiddle LUT ROMs in the FPGA. The logic resources and the clock cycles of the FFT module is reduced in our implementation using specific butterflies modules at the first and second stages. The first stage utilizes the feature of the twiddle factors related to the first stages of the pipeline.

W N N / 2 = 1 E11

So, the first stage can be implemented in a very simple way with an adder/subtracter. In the second stage, the next twiddle factors are

W N N / 4 = j E12

This twiddle suggests a similar splitting structure in the second pipeline stage as in the first one; however, the imaginary unit imposes a special consideration: two additional multiplexers change real and imaginary data, and the pipeline adder/subtracter works according equation 13.

( a + b j ) j = a j b = b + a j E13

Taking into account these features, the 1D-FFT architecture implementation is depicted in figure 8. The swap-blocks arrange the data flow (according figure 5) and preserve the pipeline feature. It consists of two multiplexers and two shift registers. These shift registers are implemented using look-up tables (LUT) in mode shift register (SRL16) for synchronization (figure 9).

Figure 8.

Architectural block diagram of a radix-2 pipeline FFT

Figure 9.

Swap unit

The system performs the calculation of the FFT with no scaling. The unscaled full-precision method was used to avoid error propagations. This option avoids overflow situations because output data have more bits than input data. Data precision at the output is:

o u t p u t w i d t h = i n p u t w i d t h + log 2 p o i n t s + 1 E14

The number of bits on the output of the multipliers is much larger than the input and must be reduced to a manageable width. The output can be truncated by simply selecting the MSBs required from the filter. However, truncation introduces an undesirable DC data shift. Due to the nature of two's complement numbers, negative numbers become more negative and positive numbers also become more negative. The DC shift can be improved with the use of symmetric rounding stages (figure 8). The periodic signals of the swap units, op signal, and the address generation for the twiddle memories are obtained through a counter module that acts like a control unit.

The pipeline 1D-FFT architecture design is completely parametrisable and portable. The VHDL module has three generics in order to obtain a standard module: fwd_inv, data_width and lognpoints (the logarithm in base 2 of the number of elements that fit with the number of stages in the 1-D FFT calculation). These generics then select direct or inverse transform, data value precision and the transform length.

4.1. Temporal analysis for the radix-2 pipeline FFT module and superior radix

In Table 1 is depicted the latency for each stage of an 8-points FFT (figure 8).

Stage Module Cycles
0 Swap 1 5
Adder 1 2
Swap 2 5
1 Multiplexers 1
Swap 3 3
Add/Sub 2 2
Swap 4 3
2 Multiplier 4
Round 1
Swap 5 2
Adder 3 2
Swap 6 2
Total 32

Table 1.

Latency of each module in a pipeline 8-points FFT

Taking into account the clock cycles of each block in table 1, the latency of the N-points FFT module (log2N stages) can be written as

l a t e n c y = 2 ( N 2 + 1 ) + 2 + 1 + 2 ( N 4 + 1 ) + 2 + + n = 2 ( log 2 N ) 1 [ 7 + 2 ( N 2 n + 1 + 1 ) ] , n = 2, 3, ( log 2 N ) 1 E15

where the two first stages are considered separately, and N and n are the number of points of the transform and the number of stages of the module respectively. Operating,

l a t e n c y = 9 + 3 N 2 + n = 2 ( log 2 N ) 1 [ 9 + N 2 n ] = = 9 + 3 N 2 + 9 ( log 2 N 2 ) + n = 2 ( log 2 N ) 1 N 2 n = = 3 N 2 + 9 log 2 N 9 + N n = 2 ( log 2 N ) 1 1 2 n = = 3 N 2 N 2 + 9 log 2 N 9 + N n = 1 ( log 2 N ) 1 1 2 n = = N + 9 log 2 N 9 + N n = 1 ( log 2 N ) 1 1 2 n E16

The last summand of this equation is a geometric series with common ratio equal to ½. This is a convergent series and the partial sum to r of the series is

S r = n = 1 r 1 2 n = 2 r 1 2 r E17

Where in this case r=log 2 N-1, so

S ( log 2 N ) 1 = 2 ( log 2 N ) 1 1 2 ( log 2 N ) 1 = 2 log 2 N 2 1 1 2 log 2 N 2 1 = N 2 1 N 2 = N 2 N E18

Adding the geometrical series in equation 17 and grouping, finally

l a t e n c y = 2 N + 9 log 2 N 11 , N = 8, 16, 32, E19

When the number of points of the FFT is a power of 4 is computationally more efficient to use a radix 4 algorithm instead of used radix 2. The reasoning is the same than radix 2 but subdividing iteratively a sequence of N data into four subsequences, and so on. The radix-4 FFT algorithm consists of log4N stages, each one containing N/4 butterflies. As the first weight is W N 0 = 1 , each butterfly involves three complex multiplications and 12 complex sums. Performing the sum in two steps, according to Proakis & Manolakis (1996), it is possible to reduce the number of sums (12 to 8). Therefore, the number of complex sums to perform is the same (Nlog 2 N) than the algorithm in base 2, but the multipliers are reduced by 25% (of (N/2)log 2 N to (3N/8)log 2 N). Consequently, the number of circuits for use DSP48 is reduced proportionately.

For number of point power of 4, the pipeline radix-4 FFT module has half of arithmetic stages, but the swap modules need twice clock cycles to arrange the data. Then, the latency is expressed as

l a t e n c y ( r a d i x 4 ) = 2 ( 3 4 N + 1 ) + 2 + + n = 1 log 4 N 1 [ 7 + 2 ( 3 4 n + 1 N + 1 ) ] = = 9 log 4 N 5 + 6 N n = 1 log 4 N 1 1 4 n E20

Again, the series can be estimated for log 4 N-1 terms, and finally

l a t e n c y ( r a d i x 4 ) = 2 N + 9 log 4 N 13 , N = 16, 64, 256, 1024, E21

This time estimation has been realized for other radix, as shown in the following equations:

l a t e n c y ( r a d i x 8 ) = 2 N + 9 log 8 N 21 , N = 64, 512, l a t e n c y ( r a d i x 16 ) = 2 N + 9 log 16 N 37 , N = 2 8 , 2 12 , E22


l a t e n c y ( r a d i x i ) = 2 N + 9 log i N 5 2 i , N = i n + 1 , n = 1,2,3, i = 4, 8, 16, E23

In Figure 10 is depicted the clock cycles of each algorithm and a proposed resolution with an orange line. All implementations are close to 2 when the number of points grows (figure 10a). The improvement in terms of computing speed of the algorithm using other radix is relevant when the number of samples is small. For example, the improvement factor for a 1024-point FFT is less than 7% using a radix-32 algorithm and less than 3% using a radix-4 (Figure 10b). However, in our astronomical case the proposed size is relativity small and the improvement using superior radix is relevant. Examining figure 10b, we can observe that the improvement factor is about 20% using radix-8 and 30% using radix-16. Thus, we are considering implementing these algorithms in the future.

4.2. Resources analysis

Several FFTs were implemented over a XC6VLX240T Virtex-6 device and numerical results were satisfactorily compared with Matlab simulations. Resources in this FPGA include 301440 flip-flops, 150720 6-LUTs, 416 BRAM and 768 DSP48s. The syntheses were achieved by changing the size of the FFT and the data precision.

Figure 11 shows the resource utilization to implement a pipeline 1024-FFT when the input data precision is between 8 and 24 bits. The resource utilization is greater when the FFT has more precision due to the increase of the intrinsic complexity found when the precision is increased. From 17 bits of precision, DSP48 circuits are increased and the maximum operating frequency is drastically smaller. It is due to high-precision complex multipliers need eight DSP48 circuits instead four in Figure 6. These high-precision multipliers are increased in each stage of the FFT from 17 bits data precision.

Figure 10.

a) Normalized latency. (b) Relative improvement respect of radix-2 algorithm

In figure 12 is depicted the resource utilization to implement one-dimension FFT when de number of points of the transform is increased in a power of 2.

Figure 11.

FPGA Resources and maximum operating frequency for 1024-FFT

Figure 12.

FPGA Resources and maximum operating frequency for N-FFT with N from 8 to 4096

4.3. Comparison with others implementations

A comparison has been carried out between our design and others implementations. The combined use of the FPGA technology and the developed architecture achieves an improved performance compared to other alternatives. This is showed in figure 13 where our implementation executes an 1024-point FFT operation in 10.64 μs at 200 MHz.


5. Wavefront phase recovery FPGA-implementation

We will focus on the FPGA implementation from equations (6) and (7) to improve processing time. These equations can be implemented using different architectures. We could choose a sequential architecture with a unique 2D-FFT module where data values use this module three times in order to calculate the phase. This architecture represents an example of an implementation using minimal resources of the FPGA. However, we are looking for a fast implementation of the equations in order to stay within the 6 ms limit of the atmospheric turbulence. Given these considerations we chose a parallel and completely pipeline architecture to implement the algorithm. Although the resources of the device increase considerably, we can maintain time restrictions by using extremely high-performance signal processing capability through parallelism. We therefore synthesize three 2D-FFTs instead of one 2D-FFT.

Figure 13.

FPGA Resources and maximum operating frequency for N-FFT with N from 8 to 4096

The block diagram of the designed recoverer is depicted in figure 14 where S x and S y represent the image displacement into each subpupil. The bidimensional transforms of S x and S y have to be multiplied by ip/p 2 +q 2 and iq/p 2 +q 2 respectively according to equation 6. These two matrices are identical if we change rows by columns. We can therefore store a unique ROM. The results of the adders (a pq coefficients) are rounded appropriately to obtain 16 bits data precision according with the data input width of the inversed bidimensional transform that is executed at the next stage.

An analysis of the equations and a parallel architecture of its implementation are taken into account. We then break down the design into the following steps or stages:

  1. Compute two real forward 2D FFT that compute FFT(S x ) and FFT(S y ).

  2. Compute the complex coefficients

  3. Carry out a complex inverse 2D FFT on a pq .

  4. Flip data results.

Figure 14.

Architecture of the synthesized phase recovery


5.1. 2D-FFT on the Virtex-6 FPGA

The fundamental operation in order to calculate the 2DFFT is equivalent to doing a 1D-FFT on the rows of the matrix and then doing a 1D-FFT on the columns of the result. Traditionally, the parallel and pipeline algorithm is then implemented in the following four steps.

  1. Compute the 1D-FFT for each row

  2. Transpose the matrix

  3. Compute the 1D-FFT for each row

  4. Transpose the matrix

Figure 15 depicts the diagram of the implemented transform module. The operation of the developed system takes place when image data is received in serial form by rows. These data are introduced in a block that carries out a one dimensional FFT. As this module obtains the transformed data, the information is stored in two double-port memories (real and imaginary data). To complete the bidimensional FFT, the stored data is introduced in a second 1D-FFT in column format. The 2D-FFT is then obtained from the output of this block.

Continuous data processing using a single dual-port memory (real and imaginary) is not possible. In that case, the new transformed data must wait for the old data to be introduced in the second FFT block. Otherwise data are overwritten. As a result, the pipeline property of the FFT architecture can not be used. This problem can be averted by using two memories instead of one, where memories are continuously commuting between write and read modes. When the odd memory is reading and introducing data values in the second FFT module, the even memory is writing data which arrives from the first FFT. So, data flow is continuous during all of the calculations in the bidimensional transform. The memory modes are always alternating and the function is selected by the counter. The same signal is used to commute the multiplexer that selects the data that enter the column transform unit.

Figure 15.

Block diagram of the implemented 2D-FFT

It is worth mentioning that the transposition step defined in the above (step 2) is implemented simultaneously with the transfer of column data vector to the memory with no delay penalty. In this way, the counter acts as an address generation unit. The last transposition step (step 4) is not implemented in order to save resources and obtain a fast global system. So, the last transposition step is taken into account only at last of the algorithm described in eq. (4)-(5) and showed in figure 13 (Flip-RAM module).

Row 1D-FFT block and column 1D-FFT block are not identical due to the unscaled data precision. For example, a 64x64 2D-FFT for the phase recoverer must meet certain requirements. If the precision of data input is 8 bits, the output data of 1D-FFT of the rows has to be 15 bits according equation 13. 1D-FFT of the columns accepts a 15 bits data format and 22 bits at the output.

Taking into account the latency of the FFT (equation 18) and the pipeline operation of the memory modules, the latency of the 2D-FFT module can be written as

l a t e n c y = N 2 + 4 N + 18 log 2 N 22 , N = 8, 16, 32, E24

In Figure 16 is depicted the latency for different sizes of 2D-FFT according to equation 18.

Table 2 shows a performance comparison of existing 2D-FFTs implementations using FPGA and other technologies for matrix sizes 64x64 and 128x128. Rodríguez-Ramos et al. 2007 implemented 2D-FFT on a CPU AMD XP 3500+, 2211 GHz, with 512 KB L2 cache and on a GPU nVidia GeForce 7800 GTX graphical engine with 256 MB RAM. Uzun et al. 2007 implemented several algorithms on a Virtex-2000E FPGA chip where the fastest design is depicted in the table. It can be seen that our design shows improvements when compared to [3] and [15] in terms of frame rate performance. Magdaleno et al. 2010 implemented 2D-FFT in a XC4VSX35 Virtex-4 operating at 100 MHz. Virtex-6 family can operate twice faster and the time operation is decreased in this way.

Figure 16.

Estimation of latency for several 2D-FFT

et al. [2007]
et al. [2007]
et al. [2007]
et al. [2010]
64x64 114.5 µs 1.58 ms - 44.4 µs 22.2 µs
128x128 811.0 µs 1.68 ms 2.38 ms 170.8 µs 85.0 µs
256x256 - - - - 333.4 µs

Table 2.

2D-FFT performance comparison with other designs

In figure 17 is depicted the resource utilization to implement 2D-FFT when de number of points of the transform is increased in a power of 2. The relevant resource is the Block-RAM. The 256x256 FFT occupies the 34% of BRAM for a XC6VLX240T Virtex-6 and the maximum operating frequency is only 297 MHz.

5.2. An 64x64 phase recoverer implementation

For a first prototype of the phase recoverer, we have selected a plenoptic sensor with 64x64 pixels sampling each microlens.

The first step is a direct use of the proposed 2D FFT. To accelerate the process, the two real transforms are executed simultaneously through a parallel implementation. Observe how the two 2D-FFT of phase gradients Sx and Sy are multiplied by some constant factors according to the formula of the phase recoverer (Figure 14). An adder is necessary in the following stage to calculate the frequency coefficients and achieve an inversed 2D-FFT. Phase values are then transposed, which require an intermediate memory to properly show output data (flip-ram module). The direct 2D-FFTs are real, so the imaginary components are zero. The inversed transform allows complex input data.

Figure 17.

Estimation of latency for several 2D-FFT

The importance of the parallel execution of the Sx and Sy transforms in the algorithm needs mentioning. This feature allows both inputs to be simultaneously received and to obtain the calculation of both data just as in the previous case at the output of the 2-D FFT units.

The bidimensional transforms of Sx and Sy have to be multiplied by ip/p 2 +q 2 and iq/p 2 +q 2 respectively according to Equation 6. These two 64x64 points matrix are identical if we change rows by columns. We can therefore store a unique ROM (two 64x16 bits ROM, one ROM for the real component of the factor and the other for the imaginary part). The addresses of these ROMs are supplied by the 2D-FFT module through cnt_fil and cnt_col signals. These signals are obtained from a built-in counter in this module. The ROM module has two generics (data_width and addr_width) in order to synthesize a standard single-port memory of any size. However, every time we change these generics, we have to use mathematical software (Matlab in this case) in order to calculate the elements and initiate ROM memory.

There are two complex multipliers. Each one performs the complex multiplication of the constant stored in the ROM by the 2-D FFT result (previously rounding to 18 bits). The complex multiplication needs four DSP48 circuits. Inside of this circuit, the complex multiplication uses multipliers, internal registers and adders/subtractors. These modules are completely standard using the a_width and b_width generics that select the precisions of the signals to multiply.

The sum of the outputs of the complex multipliers is implemented using slices exclusively. We implemented two adders, one for the real components and other for the imaginary ones. This module is also configured with a generic parameter that supplies data precision. The results of the adders (apq coefficients) are rounded appropriately to obtain 8 bits data precision according with the data input width of the inversed bidimensional transform that is executed at the next stage.

The FLIP-RAM module synthesizes a double dual-port memory similar to the memories that were described in the 2D-FFT section. While a memory is in read mode, the other one is in write mode. With this consideration, the total implemented system continues being pipeline. The addressing of the memories is obtained in a similar form, through a counter that is included in the inversed 2D-FFT. In this case, the memories only store real data (data values of the recovered phase). This module is necessary because the phase data that the inversed 2D-FFT provides are disorderly. The implemented module is depicted in figure 18.

Figure 18.

Block diagram of the output stage of the wavefront recoverer (Flip-RAM)

The design of a 64x64 phase recoverer was programmed using the VHDL hardware description language (IEEE, 2000) and XST (Xilinx, 2006b) was used to synthesize these modules into a XC6VLX240T Virtex-6 FPGA.

The complete system was successfully tested in circuit using ChipScope Pro software (using phase gradients obtained in simulations) that directly inserts a logic analyzer and bus analyzer into the design, allowing any internal signal to be viewed. Signals are captured at operating system speed and brought out through the programming interface. Captured signals can then be analyzed with the PC that acts as a logic analyzer. The numeric results were also successfully compared with those obtained in Matlab. Figure 19 shows an example of a wavefront reconstruction using a 64 × 64 subpupil recoverer. The two first images show the phase gradients (S x , S y ) given to the module. The last picture is the recovered phase using the implemented module.

Figure 19.

Phase gradients (S x and S y ) and the recovered phase for a Shack-Hartmann sensor with 64x64 subpupils

Table 3 shows the total time broken down into the stages of the total system (depicted in figure 13). 12980 clock cycles are necessary for phase recovery, starting with data reception to the activation of the ready signal. This value is the latency time for the phase recoverer. At a 200 MHz frequency clock, the system needs less than 65 μs to recover the phase.

Module Cycles Duration (@ 200 MHz)
(Sx and Sy)
4438 22.19 µs
Multipliers 4 0.02 µs
Adder 1 0.005 µs
IFFT2D 4438 22.19 µs
Flip-RAM 4096 20.48 µs
Rounding (3) 3 0.015 µs
Total 12980 64.90 µs

Table 3.

Execution time (latency) for the different stages of the 64x64 phase recoverer

Generalizing, the latency for a generic NxN phase recoverer is determined as follows:

l a t e n c y = 2 ( N 2 + 4 N + 18 log 2 N 22 ) + 4 + 1 + 3 + N 2 = = 3 N 2 + 8 N + 36 log 2 N 36 , N = 8, 16, 32, E25

Table 4 shows XC6VLX240T Virtex-6 resource utilization and the maximum operating frequency (pre-synthesis).

Slices FFs 6-LUT DSP48 BRAM Fmax [MHz]
5819 (1%) 5000 (3%) 104 (13%) 58 (13%) 402.188

Table 4.

XC6VLX240T Virtex-6 resource utilization to implement the phase recoverer prototype.


The implemented architecture is pipeline. This architecture allows phase data to be obtained for each 4,096 clock cycles (this number coincides with the number of points of the transforms, that is, the number of subpupils, 64 × 64, of the Shack-Hartmann sensor). Using the 200 MHz clock, the prototype provides new phase data each 20.48 μs.

These results can be compared with other works. Rodriguez-Ramos et al. (2007) implemented a 64 × 64 phase recoverer using GPU. In this technology, the wavefront reconstruction needs 3.531 ms. The FPGA implementation results almost 30 times faster. Baik et al. 2007 implemented a wave reconstruction with a 14 × 14 Shack-Hartmann array, an IBM PC and a Matrox Meteor-2 MC image processing board. The wavefront correction speed of the total system was 0.2 s. Although the system includes the gradient estimation, it can be seen that the execution times are slower than in the proposed implementation. Seifer et al. 2005 used a sensor with 16 × 12 subpupils and a Pentium M, 1.5 GHz. The wavefront reconstruction in this case was 50 ms using Zernike polynomials to adjust to the coefficients of the aperture function. Again, our implementation using FPGA technology is comparatively faster.


6. Conclusion

A wavefront phase can be recovered from a Shack-Hartmann sensor using FPGA as exclusive computational resource. Wavefront phase recovery in an FPGA is an even more satisfying computational technique because recovery times result faster than GPU or CPU implementations.

A 64 × 64 wavefront recoverer prototype was synthesized with a Xilinx XC6VLX240T Virtex-6 as sole computational resource. This FPGA is provided in a ML605 evaluation platform. Our prototype was designed using ISE Foundation 13.1. The system has been successfully validated in the FPGA chip using simulated data.

A two-dimensional FFT is implemented as nuclei algorithm of the recoverer: processing times are really short. The system can process data in much lower times than the atmospheric response. This feature allows more phases to be introduced in the adaptive optical process. Then, the viability of the FPGAs for AO in the ELTs is assured.

Future work is expected to be focused on the optimization of the 2D-FFT using others algorithms (radix-8, radix-16). Finally, next-generation Virtex-7 devices provide enough DSP48 resources in order to implement all the butterflies in the 64-FFT algorithm. Using these devices, phases in the adaptive optic process could be estimated in much lower times.


  1. 1. Analog Devices 2003 Analog Devices DSP Selection Guide 2003 Edition, Analog Devices, 2003, Available from
  2. 2. Amphion 2002 CS248 FFT/IFFT Core Datasheet, Amphion Ltd., 2002. Available from www. -Amphion-datasheet.html
  3. 3. Altera 2002 FFT Megacore Function User Guide, 2002, Available from
  4. 4. Bass B. 1999 A low-power, high-performance 1024-point FFT processor, IEEE J. Solid-State Circuits, 34 380 387 , 0018-9200
  5. 5. Baik S. H. Park S. K. Kim C. J. Cha B. 2007 A center detection algorithm for Shack-Hartmann wavefront sensor, Optics & Laser Technology, 39, 262-267, 2007, 0030-3992
  6. 6. Chang T. S. Jen C. W. . 1998 Hardware efficient transform designs with cyclic formulation and subexpression sharing, Proceedings of th 1998 IEEE ISCAS 1998 2 398 401 , 0-78034-455-3 California, USA, May 31-Jun 3, 1998.
  7. 7. Chien C. D. Lin C. C. Yang C. H. Guo J. I. 2005 Design and realization of a new hardware efficient IP core for the 1-D discrete Fourier transform IEE Proc. Circuits, Devices and Systems, 152 247 258 , 1350-2409
  8. 8. Cooley J. W. Tukey J. W. 1965 An algorithm for the machine calculation of complex Fourier series, Mathematics of Computation, 1965 19 297 301 , 0025-5718
  9. 9. Craven S. Athanas P. . 2007 Examinig the Viability of FPGA Supercomputing, EURASIP Journal on Embedded Systems, 2007, 8, 1687-3963
  10. 10. Deschamps J. Bioul G. Sutter G. 2006 Synthesis of Arithmetic Circuits. FPGA, ASIC and Embedded Systems, Wiley-Interscience, 2006, 0-47168-783-9
  11. 11. Frigo M. Johnson S. 1998 FFTW: An adaptive software of the FFT, Proceedings of IEEE International Conference On Acoustics, Speech, and Signal Processing 1998 1381 1384 , 0-78034-428-6 Washington, USA, May 12-15.
  12. 12. Guo J. I. 2000 An efficient parallel adder based design for one dimensional discrete Fourier transform, Proc. Natl. Sci. Counc. ROC, 2000 24 195 204 .
  13. 13. Hawkes G. C. 2005 DSP: Designing for Optimal Results. High-Performance DSP Using Virtex-4 FPGAs, Xilinx, Available from
  14. 14. IEEE 2000 Standard VHDL Language Reference Manual, IEEE-1076 2000 , IEEE, 2000, 0-73813-326-4
  15. 15. Magdaleno E. Rodríguez M. Rodríguez-Ramos J. M. 2010 An efficient pipeline wavefront phase recovery for the CAFADIS camera for extremely large telescopes, Sensors, 1424-8220
  16. 16. Mentor Graphics 2002 FFT/WinFFT/Convolver Transforms Core Datasheet, 2002, Available from
  17. 17. Meyer-Baese U. 2001 Digital Signal Processing with Field Programmable Gate Arrays, Springer-Berlag, 2001, 3-54072-612-8
  18. 18. Motorola 2002 Motorota DSP 56600 16 -bit DSP Family Datasheet, Motorola, 2002, Available from
  19. 19. Rodríguez-Ramos J. M. Marichal-Hernández J. G. Rosa F. 2007 Modal Fourier wavefront reconstruction using graphics processing units, Journal of Electronic Imaging, 16 123 134 , 1017-9909
  20. 20. Poyneer L. A. Gave D. T. Brase J. M. 2002 Fast wave-front reconstruction in large adaptive optics systems with use of the Fourier transforms, Journal of the Optical Society of America A, 2002 19 2100 2111 , 1084-7529
  21. 21. Roddier F. Roddier C. 1991 Wavefront reconstruction using iterative Fourier transforms, Applied Optics, 30 1325 1327 , 2155-3165
  22. 22. Proakis J. G. Manolakis D. K. 1996 Digital Signal Proccesing. Principles, Algorithms and Applications, Prentice Hall, 1996, 0-13187-374-1
  23. 23. Seifert L. Tiziani H. J. Osten W. 2005 Wavefront reconstruction with the adaptive Shack-Hartmann sensor, Optics Communications, 245, 255-269, 2005, North Holland, 0030-4018
  24. 24. Sukhsawas S. Benkrid K. 2004 A high-level implementation of a high performance pipeline FFT on Virtex-E FPGAs, Proceedings of the IEEE Computer Annual Symposium on VLSI Emerging Trends in VLSI Systems Design 2004, 0-76952-097-9 Louisiana, USA, February 19 20 .
  25. 25. Texas Instruments 2003 Texas Instruments C62x and C67x DSP Benchmarks, Texas Instruments, 2003, Available from
  26. 26. Uzun I. S. Amira A. Bouridane A. 2005 FPGA implementations of fast Fourier transforms for real-time signal and image processing, IEE Proceedings- Vision Image and Signal Processing, 152 283 296 , 0135-0245X.
  27. 27. Vite J. A. Romero R. Ordaz A. 2005 VHDL Core for 1024 -Point Radix-4 FFT Computation, Proceedings of the 2005 International Conference on Reconfigurable Computing and FPGAs, Puebla, Mexico, September 28-30, 0-76952-456-7
  28. 28. Wosnitza M. 1999 High precision 1024 -point FFT processor for 2-D object detection, PhD thesis, Harung-Gorre Verlag, Germany, 1999.
  29. 29. Xilinx 2006a Fast Fourier Transform 3 2, 2006, Available from
  30. 30. Xilinx 2006b XST User Guide, Xilinx, 2006, 118 217 , Available from
  31. 31. Zhou Y. Noras J. M. Shepherd S. J. 2007 Novel design of multiplier-less FFT processors, Signal Processing, 87 1402 1407 , 0165-0684

Written By

Eduardo Magdaleno and Manuel Rodríguez

Submitted: March 23rd, 2011 Published: January 20th, 2012