FSMD-Based Hardware Accelerators for FPGAs

Current VLSI technology allows the design of sophisticated digital systems with escalated demands in performance and power/energy consumption. The annual increase of chip complexity is 58%, while human designers productivity increase is limited to 21% per annum (ITRS, 2011). The growing technology-productivity gap is probably the most important problem in the industrial development of innovative products. A dramatic increase in designer productivity is only possible through the adoption of methodologies/tools that raise the design abstraction level, ingeniously hiding low-level, time-consuming, error-prone details. New EDAmethodologies aim to generate digital designs fromhigh-level descriptions, a process called High-Level Synthesis (HLS) (Coussy & Morawiec, 2008) or else hardware compilation (Wirth, 1998). The input to this process is an algorithmic description (for example in C/C++/SystemC) generating synthesizable and verifiable Verilog/VHDL designs (IEEE, 2006; 2009).


Introduction
Current VLSI technology allows the design of sophisticated digital systems with escalated demands in performance and power/energy consumption.The annual increase of chip complexity is 58%, while human designers productivity increase is limited to 21% per annum (ITRS, 2011).The growing technology-productivity gap is probably the most important problem in the industrial development of innovative products.A dramatic increase in designer productivity is only possible through the adoption of methodologies/tools that raise the design abstraction level, ingeniously hiding low-level, time-consuming, error-prone details.New EDA methodologies aim to generate digital designs from high-level descriptions, a process called High-Level Synthesis (HLS) (Coussy & Morawiec, 2008) or else hardware compilation (Wirth, 1998).The input to this process is an algorithmic description (for example in C/C++/SystemC) generating synthesizable and verifiable Verilog/VHDL designs (IEEE, 2006;2009).
Our aim is to highlight aspects regarding the organization and design of the targeted hardware of such process.In this chapter, it is argued that a proper Model of Computation (MoC) for the targeted hardware is an adapted and extended form of the FSMD (Finite-State Machine with Datapath) model which is universal, well-defined and suitable for either data-or control-dominated applications.Several design examples will be presented throughout the chapter that illustrate our approach.

Higher-level representations of FSMDs
This section discusses issues related to higher-level representations of FSMDs (Gajski & Ramachandran, 1994) focusing on textual intermediate representations (IRs).It first provides a short overview of existing approaches focusing on the well-known GCC GIMPLE and LLVM IRs.Then the BASIL (Bit-Accurate Symbolic Intermediate Language) is introduced as a more appropriate lightweight IR for self-contained representation of FSMD-based hardware architectures.Lower-level graph-based forms are presented focusing on the CDFG (Control-Data Flow Graph) procedure-level representation using Graphviz (Graphviz, 2011) files.This section also illustrates a linear CDFG construction algorithm from BASIL.In addition, an end-to-end example is given illustrating algorithmic specifications in ANSI

Representing programs in BASIL
BASIL provides arbitrary n-to-m mappings allowing the elimination of implicit side-effects, a single construct for all operations, and bit-accurate data types.It supports scalar, single-dimensional array and streamed I/O procedure arguments.BASIL statements are labels, n-address instructions or procedure calls.
BASIL is similar in concept to the GIMPLE and LLVM intermediate languages but with certain unique features.For example, while BASIL supports SSA form, it provides very light operation semantics.A single construct is required for supporting any given operation as an m-to-n mapping between source and destination sites.An n-address operation is actually the specification of a mapping from a set of n ordered inputs to a set of m ordered outputs.An n-address instruction (or else termed as an n, m-operation) is formatted as follows: outp1, ..., outpm <= operation inp1, ..., inpn; where: • operation is a mnemonic referring to an IR-level instruction • outp1, ..., outpm are the m outputs of the operation • inp1, ..., inpn are the n inputs of the operation In BASIL all declared objects (global variables, local variables, input and output procedure arguments) have an explicit static type specification.BASIL uses the notions of "globalvar" (a global scalar or single-dimensional array variable), "localvar" (a local scalar or single-dimensional array variable), "in" (an input argument to the given procedure), and "out" (an output argument to the given procedure).
BASIL supports bit-accurate data types for integer, fixed-point and floating-point arithmetic.Data type specifications are essentially strings that can be easily decoded by a regular expression scanner; examples are given in Table 1.
The EBNF grammar for BASIL is shown in Fig. 1 where it can be seen that rules "nac" and "pcall" provide the means for the n-to-m generic mapping for operations and procedure calls, respectively.It is important to note that BASIL has no predefined operator set; operators are defined through a textual mnemonic.
For instance, an addition of two scalar operands is written: a <= add b, c;.Control-transfer operations include conditional and unconditional jumps explicitly visible in 145 FSMD-Based Hardware Accelerators for FPGAs www.intechopen.com

✝ ✆
Fig. 1.EBNF grammar for BASIL.the IR.An example of an unconditional jump would be: BB5 <= jmpun; while conditional jumps always declare both targets: BB1, BB2 <= jmpeq i, 10;.This statement enables a control transfer to the entry of basic block BB1 when i equals to 10, otherwise to BB2.Multi-way branches corresponding to compound decoding clauses can be easily added.
An interesting aspect of BASIL is the support of procedures as non-atomic operations by using a similar form to operations.In (y) <= sqrt(x); the square root of an operand x is computed; procedure argument lists are indicated as enclosed in parentheses.

BASIL program structure and encoding
A specification written in BASIL incorporates the complete information of a translation unit of the original program comprising of a list of "globalvar" definitions and a list of procedures (equivalently: control-flow graphs).A single BASIL procedure is captured by the following information: • procedure name • ordered input (output) arguments • "localvar" definitions • BASIL statements.
Label items point to basic block (BB) entry points and are defined as name, bb, addr 3-tuples, where name is the corresponding identifier, bb the basic block enumeration, and addr the absolute address of the statement succeeding the label.
Statements are organized in the form of a C struct or equivalently a record (in other programming languages) as shown in Fig. 2.
The Statement ADT therefore can be used to model an (n, m)-operation.The input and output operand lists collect operand items, as defined in the OperandItem data structure definition shown in Fig. 3.

A basic BASIL implementation
A basic operation set for RISC-like compilation is summarized in Table 2. N i (N o ) denotes the number of input (output) operands for each operation.
The memory access model defines dedicated address spaces per array, so that both loads and stores require the array identifier as an explicit operand.For an indexed load in C (b = a[i];), a frontend would generate the following BASIL: b <= load a, i;, while for an indexed store (a[i] = b;)itisa <= store b, i;.
Pointer accesses can be handled in a similar way, although dependence extraction requires careful data flow analysis for non-trivial cases.Multi-dimensional arrays are handled through matrix flattening transformations.

CDFG construction
A novel, fast CDFG construction algorithm has been devised for both SSA and non-SSA BASIL forms producing flat CDFGs as Graphviz files (Fig. 5).A CDFG symbol table item is a node (operation, procedure call, globalvar, or constant) or edge (localvar) with user-defined attributes: the unique name, label and data type specification; node and edge type enumeration; respective order of incoming or outgoing edges; input/output argument order of a node and basic block index.Further attributes can be defined, e.g. for scheduling bookkeeping.
This approach is unique since it focuses on building the CDFG symbol table (st) from which the associated graph (cdfg) is constructed as one possible of many facets.It naturally supports loop-carried dependencies and array accesses.

Fixed-point arithmetic
The use of fixed-point arithmetic (Yates, 2009) provides an inexpensive means for improved numerical dynamic range, when artifacts due to quantization and overflow effects can be tolerated.Rounding operators are used for controlling the numerical precision involved in a series of computations; they are defined for inexact arithmetic representations such as fixed-  • the C99 standard (ISO/IEC JTC1/SC22, 2007) • lightweight custom implementations such as (Edwards, 2006) • explicit data types with open source implementations (Mentor Graphics, 2011;SystemC, 2006) Fixed-point arithmetic is a variant of the typical integral representation (2's-complement signed or unsigned) where a binary point is defined, purely as a notational artifact to signify integer powers of 2 with a negative exponent.Assuming an integer part of width IW > 0 and a fractional part with −FW < 0, the VHDL-2008 sfixed data type has a range of 2 IW−1 − 2 |FW| to −2 IW−1 with a representable quantum of 2 |FW| (Bishop, 2010a;b).The corresponding ufixed type has the following range: 2 IW − 2 |FW| to 0. Both are defined properly given a IW-1:-FW vector range.
BASIL currently supports a proposed list of extension operators for handling fixed-point arithmetic: • conversion from integer to fixed-point format: i2ufx, i2sfx • conversion from fixed-point to integer format: ufx2i, sfx2i • operand resizing: resize, using three input operands; source operand src1 and src2, src3 as numerical values that denote the new size (high-to-low range) of the resulting fixed-point operand •r o u n d i n g p r i m i t i v e s : ceil, fix, floor, round, nearest, convergent for rounding towards plus infinity, zero, minus infinity, and nearest (ties to greatest absolute value, plus infinity and closest even, respectively).

Scan-based SSA construction algorithms for BASIL
In our experiments with BASIL we have investigated minimal SSA construction schemes -the Appel (Appel, 1998) and Aycock-Horspool (Aycock & Horspool, 2000) algorithms -that don't require the computation of the iterated dominance frontier (Cytron et al., 1991).In traditional compilation infrastructures (GCC, LLVM) (GCC, 2011; LLVM, 2011), Cytron's approach (Cytron et al., 1991) is preferred since it enables bit-vector dataflow frameworks and optimizations that require elaborate data structures and manipulations.It can be argued that rapid prototyping compilers, integral parts of heterogeneous design flows, would benefit from straightforward SSA construction schemes which don't require the use of sophisticated concepts and data structures (Appel, 1998;Aycock & Horspool, 2000).
The general scheme for these methods consists of series of passes for variable numbering, φ-insertion, φ-minimization, and dead code elimination.The lists of BASIL statements, localvars and labels are all affected by the transformations.
The first algorithm presents a "really-crude" approach for variable renaming and φ-function insertion in two separate phases (Appel, 1998).In the first phase, every variable is split at BB boundaries, while in the second phase φ-functions are placed for each variable in each BB.
Variable versions are actually preassigned in constant time and reflect a specific BB ordering (e.g.DFS).Thus, variable versioning starts from a positive integer n, equal to the number of BBs in the given CFG.
The second algorithm does not predetermine variable versions at control-flow joins but accounts φs the same way as actual computations visible in the original CFG.Due to this fact, φ-insertion also presents dissimilarities.Both methods share common φ-minimization and dead code elimination phases.

Application profiling with BASILVM
BASIL programs can be translated to low-level C for the easy evaluation of nominal performance on an abstract machine, called BASILVM.To show the applicability of BASILVM profiling, a set of small realistic integer/fixed-point kernels has been selected: atsort (an all topological sorts algorithm (Knuth, 2011)), coins (compute change with minimum amount of coins), easter (Easter date calculations), fixsqrt (fixed-point square root (Turkowski, 1995)), perfect (perfect number detection), sieve (prime sieve of Eratosthenes) and xorshift (100 calls to George Marsaglia's PRNG (Marsaglia, 2003) with a 2 128 − 1 period, which passes Diehard tests).procedures), vertices and edges (for each procedure) in columns 4-5, amount of φ statements (column 6) and the number of dynamic instructions for the non-SSA case.The latter is measured using gcc-3.4.4 on Cygwin/XP by means of the executed code lines with the gcov code coverage tool.

Representative example: 2D Euclidean distance approximation
A fast linear algorithm for approximating the euclidean distance of a point (x, y) from the origin is given in (Gajski et al., 2009) by the equation: eda = MAX((0.875* x + 0.5 * y), x) where x = MAX(|a|, |b|) and y = MIN(|a|, |b|).The average error of this approximation against the integer-rounded exact value (dist = √ a 2 + b 2 ) is 4.7% when compared to the rounded-down ⌊dist⌋ and 3.85% to the rounded-up ⌈dist⌉ value.Fig. 6 shows the three relevant facets of eda: ANSI C code (Fig. 6(a)), a manually derived BASIL implementation (Fig. 6(b)) and the corresponding CDFG (Fig. 6(c)).Constant multiplications have been reduced to adds, subtracts and shifts.The latter subfigure naturally also shows the ASAP schedule of the data flow graph, which is evidently of length 7.

Architecture and organization of extended FSMDs
This section deals with aspects of specification and design of FSMDs, especially their interface, architecture and organization, as well as communication and integration issues.The section is wrapped-up with realistic examples of CDFG mappings to FSMDs, alongside their performance investigation with the help of HDL simulations.

FSMD overview
A Finite State Machine with Data (FSMD) specification (Gajski & Ramachandran, 1994) is an upgraded version of the well-known Finite State Machine representation providing the same information as the equivalent CDFG (Gajski et al., 2009).The main difference is the introduction of embedded actions within the next state generation logic.An FSMD specification is timing-aware since it must be decided that each state is executed within a certain amount of machine cycles.Also the precise RTL semantics of operations taking place within these cycles must be determined.In this way, an FSMD can provide an accurate model of an RTL design's performance as well as serve as a synthesizable manifestation of the designer's intent.Depending on the RT-level specification (usually VHDL or Verilog) it can convey sufficient details for hardware synthesis to a specific target platform, e.g.Xilinx FPGA devices (Xilinx, 2011b).

Extended FSMDs
The FSMDs of our approach follow the established scheme of a Mealy FSM with computational actions embedded within state logic (Chu, 2006).In this work, the extended FSMD MoC describing the hardware architectures supports the following features, the most relevant of which will be sufficiently described and supported by short examples: • Support of scalar and array input and output ports.
• Support of streaming inputs and outputs and allowing mixed types of input and output ports in the same design block.
• Communication with embedded block and distributed LUT memories.
• Design of a latency-insensitive local interface of the FSMD units to master FSMDs, assuming the FSMD is a locally-interfaced slave.
• Design of memory interconnects for the FSMD units.
Advanced issues in the design of FSMDs that are not covered include the following: • Mapping of SSA-form (Cytron et al., 1991) low-level IR (BASIL) directly to hardware, by the hardware implementation of variable-argument φ functions.
• Communication to global aggregate type storage (global arrays) from within the context of both root and non-root procedures using a multiplexer-based bus controlled by a scalable arbiter.

Interface
The FSMDs of our approach use fully-synchronous conventions and register all their outputs (Chu, 2006;Keating & Bricaud, 2002).The control interface is rather simple, yet can service all possible designs:  • ready: the block is ready to accept new input • valid: asserted when a certain data output port is streamed-out from the block (generally it is a vector) • done: end of computation for the block ready signifies only the ability to accept new input (non-streamed) and does not address the status of an output (streaming or not).
Multi-dimensional data ports are feasible based on their equivalent single-dimensional flattened array type definition.Then, port selection is a matter of bitfield extraction.For instance, data input din is defined as din: in std_logic_vector(M * N-1 downto 0);,w h e r eM, N are generics.The flattened vector defines M input ports of width N.A selection of the form din((i+1) * N-1 downto i * N) is typical for a for-generate loop in order to synthesize iterative structures.
The following example (Fig. 8) illustrates an element-wise copy of array b to c without the use of a local array resource.Each interface array consists of 10 elements.It should be assumed that the physical content of both arrays lies in distributed LUT RAM, from which custom connections can be implemented., where the derived array types b_type and c_type are used for b, c, respectively.The definitions of these types can be easily devised as aliases to a basic type denoted as: type cdt_type is array (9 downto 0) of std_logic_vector(31 downto 0);.Then, the alias for b is: alias b_type is cdt_type;

Architecture and organization
The FSMDs are organized as computations allocated into n + 2 states, where n is the number of required control steps as derived by an operation scheduler.The two overhead states are the entry (S_ENTRY)a n dt h ee x i t( S_EXIT) states which correspond to the source and sink nodes of the control-data flow graph of the given procedure, respectively.
Fig. 9 shows the absolute minimal example of a compliant FSMD written in VHDL.The FSMD is described in a two-process style using one process for the current state logic and another process for a combined description of the next state and output logic.This code will serve as a running example for better explaining the basic concepts of the FSMD paradigm.Fig. 8. Array-to-array copy without intermediate storage.
The example of Fig. 9(a), 9(b) implements the computation of assigning a constant value to the output port of the FSMD: outp <= ldc 42;.Thus, lines 5-14 declare the interface (entity) for the hardware block, assuming that outp is a 16-bit quantity.The FSMD requires three states.In line 17, a state type enumeration is defined consisting of types S_ENTRY, S_EXIT and S_1.Line 18 defines the signal 2-tuple for maintaining the state register, while in lines 19-20 the output register is defined.The current state logic (lines 25-34) performs asynchonous reset to all storage resources and assigns new contents to both the state and output registers.Next state and output logic (lines 37-57) decode current_state in order to determine the necessary actions for the computational states of the FSMD.State S_ENTRY is the idle state of the FSMD.When the FSMD is driven to this state, it is assumed ready to accept new input, thus the corresponding status output is raised.When a start prompt is given externally, the FSMD is activated and in the next cycle, state S_1 is reached.In S_1 the action of assigning CNST_42 to outp is performed.Finally, when state S_EXIT is reached, the FSMD declares the end of all computations via done and returns to its idle state.
It should be noted that this design approach is a rather conservative one.One possible optimization that can occur in certain cases is the merging of computational states that immediately prediate the sink state (S_EXIT)withit.
Fig. 9(c) shows the timing diagram for the "minimal" design.As expected, the overall latency for computing a sample is three machine cycles.
In certain cases, input registering might be desired.This intent can be made explicit by copying input port data to an internal register.For the case of the eda algorithm, a new localvar, a w o u l db ei n t r o d u c e dt op e r f o r mt h ec o p ya sa <= mov in1;.T h e V H D L counterpart is given as a_1_next <= in1;, making this data available through register a_1_reg in the following cycle.For register r,s i g n a lr_next represents the value that is available at the register input, and r_reg the stored data in the register.

Communication with embedded memories
Array objects can be synthesized to block RAMs in contemporary FPGAs.These embedded memories support fully synchronous read and write operations (Xilinx, 2005).A requirement for asynchronous read mandates the use of memory residing in distributed LUT storage.
In BASIL, the load and store primitives are used for describing read and write memory access.We will assume a RAM memory model with write enable, and separate data input (din)a n do u t p u t( dout) sharing a common address port (rwaddr).

✝ ✆
Fig. 10.Wait-state-based communication for loading data from a block RAM.
Synchronous load requires the introduction of a waitstate register.This register assists in devising a dual-cycle state for performing the load.Fig. 10 illustrates the implementation of a load operation.During the first cycle of STATE_1 the memory block is addressed.In the second cycle, the requested data are made available through mem_dout and are assigned to register mysignal.This data can be read from mysignal_reg during STATE_2.

Hierarchical FSMDs
Our extended FSMD concept allows for hierarchical FSMDs defining entire systems with calling and callee CDFGs.A two-state protocol can be used to describe a proper communication between such FSMDs.The first state is considered as the "preparation" state for the communication, while the latter state actually comprises an "evaluation" superstate where the entire computation applied by the callee FSMD is effectively hidden.
The calling FSMD performs computations where new values are assigned to ⋆_next signals and registered values are read from ⋆_reg signals.To avoid the problem of multiple signal drivers, callee procedure instances produce ⋆_eval data outputs that can then be connected to register inputs by hardwiring to the ⋆_next signal.
Fig. 11 illustrates a procedure call to an integer square root evaluation procedure.This procedure uses one input and one output std_logic_vector operands, both considered to represent integer values.Thus, a procedure call of the form (m) <= isqrt(x); is implemented by the given code segment in Fig. 11.
STATE_1 sets up the callee instance.The following state is a superstate where control is transferred to the component instance of the callee.When the callee instance terminates its computation, the ready signal is raised.Since the start signal of the callee is kept low, the generated output data can be transferred to the m register via its m_next input port.Control then is handed over to state STATE_3.
The callee instance follows the established FSMD interface, reading x_reg data and producing an exact integer square root in m_eval.Multiple copies of a given callee are supported by versioning of the component instances.

✝ ✆
Fig. 12. Example of a functional pipeline in BASIL.

Steaming ports
ANSI C is the archetypical example of a general-purpose imperative language that does not support streaming primitives, i.e. it is not possible for someone to express and process streams solely based on the semantics of such language.Streaming (e.g. through queues) suits applications with near-complete absence of control flow.Such example would be the functional pipeline of the form of Fig. 12 with A, B, C, D either compound types (arrays/vectors).Control flow in general applications is complex and it is not easy to intermix streamed and non-streamed inputs/outputs for each FSMD, either calling or callee.

Other issues
3.2.6.1 VHDL packages for implicit fixed-point arithmetic support The latest approved IEEE 1076 standard (termed VHDL-2008) (IEEE, 2009) adds signed and unsigned (sfixed, ufixed) fixed-point data types and a set of primitives for their manipulation.The VHDL fixed-point package provides synthesizable implementations of fixed-point primitives for arithmetic, scaling and operand resizing (Ashenden & Lewis, 2008).

Design organization of an FSMD hardware IP
A proper FSMD hardware IP should seamlessly integrate to a hypothetical system.FSMD IPs would be viewed as black boxes adhering to certain principles such as registered outputs.Unconstrained vectors help in maintaining generic blocks without the need of explicit generics, and it is an interesting idea, however not easily applicable when derived types are involved.
The outer product of two vectors A and B could be a theoretical case for a hardware block.The outer (or "cross") product is given by C = A × B or C = cross(A, B) for reading two matrices A, B to calculate C. Matrices A, B, C will have appropriate derived types that are declared in the cross_pkg.vhdpackage; a prerequisite for using the cross.vhddesign file.
Regarding the block internals, the cross product of A, B is calculated and stored in a localvar array called Clocal.Clocal is then copied (possibly in parallel) to the C interface array with the help of a for-generate construct.
3.2.6.3High-level optimizations relevant to hardware block development Very important optimizations for increasing the efficiency of system-level communication are matrix flattening and argument globalization.The latter optimization is related to choices at the hardware interconnect level.
Matrix flattening deals with reducing the dimensions of an array from N to one.This optimization creates multiple benefits: • addressing simplification • direct mapping to physical memory (where addressing is naturally single-dimensional) • interface and communication simplifications Argument globalization is useful for replacing multiple copies of a given array by a single-access "globalvar" array.One important benefit is the prevention of exhausting interconnect resources.This optimization is feasible for single-threaded applications.For the example in Fig. 12 we assume that all changes can be applied sequentially on the B array, and that all original data are stored in A.
The aforementioned optimization would rapidly increase the number of "globalvar" arrays.
A "safe" but conservative approach would apply a restriction on "globalvar" access, allowing access to globals only by the root procedure of the call graph.This can be overcome by the development of a bus-based hardware interface for "globalvar" arrays making globals accessible by any procedure.would assign to a single control step, multiple operations that are associated through data dependencies.Operation chaining is popular for deriving custom instructions or superinstructions that can be added to processor cores as instruction-set extensions (Pozzi et al., 2006).Most techniques require a form of graph partitioning based on certain criteria such as the maximum acceptable path delay.
A hardware developer could resort in a simpler means for selective operation chaining by merging ASAP states to compound states.This optimization is only possible when a single definition site is used per variable (thus SSA form is mandatory).Then, an intermediate register is eliminated by assigning to a ⋆_next signal and reusing this value in the subsequent chained computation, instead of reading from the stored ⋆_reg value.

Hardware design of the 2D Euclidean distance approximation
The eda algorithm shows good potential for speedup via operation chaining.Without this optimization, 7 cycles are required for computing the approximation, while chaining allows to squeeze all computational states into one; thus three cycles are needed to complete the operation.Fig. 14 depicts VHDL code segments for an ASAP schedule with chaining disabled (Fig. 14  Fig. 16 shows the interface signals for factoring values 6 (a composite), 7 (a prime), and 8 (a composite which is also a power-of-2).

Multi-function CORDIC
This example illustrates a universal CORDIC IP core supporting all directions (ROTATION, VECTORING) and modes (CIRCULAR, LINEAR, HYPERBOLIC) (Andraka, 1998;Volder, 1959).The input/ouput interface is similar to e.g. the CORDIC IP generated by Xilinx Core Generator (Xilinx, 2011a).It provides three data inputs (x in , y in , z in ) and three data outputs (x out , y out , z out ) as well as the direction and mode control inputs.The testbench will test the core for computing cos (x in ),s i n (y in ), arctan(y in /x in ),    design is a monolithic FSMD that does not include post-processing needed such as the scaling operation for the square root.
The FSMD for the CORDIC uses Q2.14 fixed-point arithmetic.While the required lines of ANSI C code are 29, the hand-coded BASIL representation uses 56 lines; the CDFG representation and the VHDL design, 178 and 436, respectively, showing a clear tendency among the different abstraction levels used for design representation.
The core achieves 18 (CIRCULAR, LINEAR) and 19 cycles (HYPERBOLIC) per sample or n + 4a n dn + 5 cycles, respectively, where n is the fractional bitwidth.When the operation chaining optimization is not applied, 5 cycles per iteration are required instead of a single cycle where all operations all collapsed.A single-cycle per iteration constraint imposes the use of distributed LUT RAM, otherwise 3 cycles are required per sample.
Fig. 17(a) shows a C-like implementation of the multi-function CORDIC inspired by recent work (Arndt, 2010;Williamson, 2011).CNTAB is equivalent to fractional width n,H Y P E R , LIN and CIRC are shortened names for CORDIC modes and ROTN for the rotation direction, cordic_tab is the array of CORDIC coefficients and cordic_hyp_steps an auxiliary table handling repeated iterations for hyperbolic functions.cordic_tab is used to access coefficients for all modes with different offsets (0, 14 or 28 for our case).
Table 4 illustrates synthesis statistics for two CORDIC designs.The logic synthesis results with Xilinx ISE 12.3i reveal a 217MHz (estimated) design when branching is entirely eliminated in the CORDIC loop, otherwise a faster design can be achieved (271.5 MHz).Both cycles and MHz could be improved by source optimization, loop unrolling for pipelining, and the use of embedded multipliers (pseudo-CORDIC) that would eliminate some of the branching needed in the CORDIC loop.

Conclusion
In this chapter, a straightforward FSMD-style model of computation was introduced that augments existing approaches.Our FSMD concept supports inter-FSMD communication, embedded memories, streaming outputs, and seamless integration of user IPs/black boxes.
To raise the level of design abstraction, the BASIL typed assembly language is introduced which can be used for capturing the user's intend.We show that it is possible to convert this intermediate representation to self-contained CDFGs and finally to provide an easier path for designing a synthesizable VHDL implementation.
Along the course of this chapter, representative examples were used to illustrate the key concepts of our approach such as a prime factorization algorithm and an improved FSMD design of a multi-function CORDIC.

Fig. 5 .
Fig. 5. CDFG construction algorithm accepting BASIL input.and floating-point.Proposed and in-use specifications for fixed-point arithmetic of related practice include:

•
clk: signal from external clocking source • reset (rst or arst): synchronous or asynchronous reset, depending on target specification 152 Embedded Systems -Theory and Design Methodology www.intechopen.com

Fig. 8
Fig.8(a) illustrates the corresponding function func1.The VHDL interface of func1 is shown in Fig.8(b), where the derived array types b_type and c_type are used for b, c, respectively.The definitions of these types can be easily devised as aliases to a basic type denoted as: type cdt_type is array (9 downto 0) of std_logic_vector(31 downto 0);.Then, the alias for b is: alias b_type is cdt_type;
(a)) and enabled (Fig.14(b)).Figures14(c) and 14(d) show cycle timings for the relevant I/O signals for both cases.The prime factorization algorithm (pfactor) is a paramount example of the use of streaming outputs.Output outp is streaming and the data stemming from this port should be accessed based on the valid status.The reader can observe that outp is accessed periodically in context of basic block BB3 as shown in Fig. 15(b).

FSMD
it can be used for anything computable by CORDIC iterations.The computation of 1/ √ w is performed in two stages: a) y = 1/w,b)z = √ y.T h e 159

162
Embedded Systems -Theory and Design Methodology www.intechopen.com

Table 1 .
Data type specifications in BASIL.domain specialization, investigate SSA (Static Single Assignment) construction algorithms and perform other compilation tasks.

Table 2 .
A set of basic operations for a BASIL-based IR.

List variables, List labels, Graph cfg; output SymbolTable st, Graph cdfg; begin Insert constant, input/output arguments and global variable operand nodes to st; Insert operation nodes; Insert incoming
148 Embedded Systems -Theory and Design Methodology www.intechopen.com✞ ☎ {global/constant/input, operation} and outgoing {operation, global/

output} edges; Add control-dependence edges among operation nodes; Add data-dependence edges among operation nodes, extract loop-carried dependencies via cfg-reachability; Generate cdfg from st; end
✝✆

Table 3 .
Application profiling with a BASIL framework.
To control access to such block, a set of four non-trivial signals is needed: mem_we, a write enable signal, and the corresponding signals for addressing, data input and output.store is the simpler operation of the two.It requires raising mem_we in a given single-cycle state so that data are stored in memory and made available in the subsequent state/machine cycle.

Table 4 .
Logic synthesis results for multi-function CORDIC.