High-Speed VLSI Architecture Based on Massively Parallel Processor Arrays for Real-Time Remote Sensing Applications

Developing computationally efficient processing techniques for massive volumes of hyperspectral data is critical for space-based Earth science and planetary exploration (see for example, (Plaza & Chang, 2008), (Henderson & Lewis, 1998) and the references therein). With the availability of remotely sensed data from different sensors of various platforms with a wide range of spatiotemporal, radiometric and spectral resolutions has made remote sensing as, perhaps, the best source of data for large scale applications and study. Applications of Remote Sensing (RS) in hydrological modelling, watershed mapping, energy and water flux estimation, fractional vegetation cover, impervious surface area mapping, urban modelling and drought predictions based on soil water index derived from remotelysensed data have been reported (Melesse et al., 2007). Also, many RS imaging applications require a response in (near) real time in areas such as target detection for military and homeland defence/security purposes, and risk prevention and response. Hyperspectral imaging is a new technique in remote sensing that generates images with hundreds of spectral bands, at different wavelength channels, for the same area on the surface of the Earth. Although in recent years several efforts have been directed toward the incorporation of parallel and distributed computing in hyperspectral image analysis, there are no standardized architectures or Very Large Scale Integration (VLSI) circuits for this purpose in remote sensing applications. Additionally, although the existing theory offers a manifold of statistical and descriptive regularization techniques for image enhancement/reconstruction, in many RS application areas there also remain some unsolved crucial theoretical and processing problems related to the computational cost due to the recently developed complex techniques (Melesse et al., 2007), (Shkvarko, 2010), (Yang et al., 2001). These descriptive-regularization techniques are associated with the unknown statistics of random perturbations of the signals in turbulent medium, imperfect array calibration, finite dimensionality of measurements, multiplicative signal-dependent speckle noise, uncontrolled antenna vibrations and random carrier trajectory deviations in the case of Synthetic Aperture Radar (SAR) systems (Henderson & Lewis, 1998), (Barrett & Myers, 2004). Furthermore, these techniques are not suitable for


Introduction
Developing computationally efficient processing techniques for massive volumes of hyperspectral data is critical for space-based Earth science and planetary exploration (see for example, (Plaza & Chang, 2008), (Henderson & Lewis, 1998) and the references therein). With the availability of remotely sensed data from different sensors of various platforms with a wide range of spatiotemporal, radiometric and spectral resolutions has made remote sensing as, perhaps, the best source of data for large scale applications and study. Applications of Remote Sensing (RS) in hydrological modelling, watershed mapping, energy and water flux estimation, fractional vegetation cover, impervious surface area mapping, urban modelling and drought predictions based on soil water index derived from remotelysensed data have been reported (Melesse et al., 2007). Also, many RS imaging applications require a response in (near) real time in areas such as target detection for military and homeland defence/security purposes, and risk prevention and response. Hyperspectral imaging is a new technique in remote sensing that generates images with hundreds of spectral bands, at different wavelength channels, for the same area on the surface of the Earth. Although in recent years several efforts have been directed toward the incorporation of parallel and distributed computing in hyperspectral image analysis, there are no standardized architectures or Very Large Scale Integration (VLSI) circuits for this purpose in remote sensing applications. Additionally, although the existing theory offers a manifold of statistical and descriptive regularization techniques for image enhancement/reconstruction, in many RS application areas there also remain some unsolved crucial theoretical and processing problems related to the computational cost due to the recently developed complex techniques (Melesse et al., 2007), (Shkvarko, 2010), (Yang et al., 2001). These descriptive-regularization techniques are associated with the unknown statistics of random perturbations of the signals in turbulent medium, imperfect array calibration, finite dimensionality of measurements, multiplicative signal-dependent speckle noise, uncontrolled antenna vibrations and random carrier trajectory deviations in the case of Synthetic Aperture Radar (SAR) systems (Henderson & Lewis, 1998), (Barrett & Myers, 2004). Furthermore, these techniques are not suitable for (near) real time implementation with existing Digital Signal Processors (DSP) or Personal Computers (PC). To treat such class of real time implementation, the use of specialized arrays of processors in VLSI architectures as coprocessors or stand alone chips in aggregation with Field Programmable Gate Array (FPGA) devices via the hardware/software (HW/SW) co-design, will become a real possibility for high-speed Signal Processing (SP) in order to achieve the expected data processing performance (Plaza, A. & Chang, 2008), (Castillo Atoche et al., 2010a, 2010b. Also, it is important to mention that cluster-based computing is the most widely used platform on ground stations, however several factors, like space, cost and power make them impractical for on-board processing. FPGA-based reconfigurable systems in aggregation with custom VLSI architectures are emerging as newer solutions which offer enormous computation potential in both cluster-based systems and embedded systems area. In this work, we address two particular contributions related to the substantial reduction of the computational load of the Descriptive-Regularized RS image reconstruction technique based on its implementation with massively processor arrays via the aggregation of highspeed low-power VLSI architectures with a FPGA platform. First, at the algorithmic-level, we address the design of a family of Descriptive-Regularization techniques over the range and azimuth coordinates in the uncertain RS environment, and provide the relevant computational recipes for their application to imaging array radars and fractional imaging SAR operating in different uncertain scenarios. Such descriptive-regularized family algorithms are computationally adapted for their HWlevel implementation in an efficient mode using parallel computing techniques in order to achieve the maximum possible parallelism. Second, at the systematic-level, the family of Descriptive-Regularization techniques based on reconstructive digital SP operations are conceptualized and employed with massively parallel processor arrays (MPPAs) in context of the real time SP requirements. Next, the array of processors of the selected reconstructive SP operations are efficiently optimized in fixed-point bit-level architectures for their implementation in a high-speed low-power VLSI architecture using 0.5um CMOS technology with low power standard cells libraries. The achieved VLSI accelerator is aggregated with a FPGA platform via HW/SW co-design paradigm. Alternatives propositions related to parallel computing, systolic arrays and HW/SW codesign techniques in order to achieve the near real time implementation of the regularizedbased procedures for the reconstruction of RS applications have been previously developed in (Plaza, A. & Chang, 2008), (Castillo Atoche et al., 2010a, 2010b. However, it should be noted that the design in hardware (HW) of a family of reconstructive signal processing operations have never been implemented in a high-speed low-power VLSI architecture based on massively parallel processor arrays in the past. Finally, it is reported and discussed the implementation and performance issues related to real time enhancement of large-scale real-world RS imagery indicative of the significantly increased processing efficiency gained with the proposed implementation of high-speed low-power VLSI architectures of the descriptive-regularized algorithms.

Remote sensing background
The general formalism of the RS imaging problem presented in this study is a brief presentation of the problem considered in (Shkvarko, 2006(Shkvarko, , 2008, hence some crucial model elements are repeated for convenience to the reader. The problem of enhanced remote sensing (RS) imaging is stated and treated as an illposed nonlinear inverse problem with model uncertainties. The challenge is to perform high-resolution reconstruction of the power spatial spectrum pattern (SSP) of the wavefield scattered from the extended remotely sensed scene via space-time processing of finite recordings of the RS data distorted in a stochastic uncertain measurement channel. The SSP is defined as a spatial distribution of the power (i.e. the second-order statistics) of the random wavefield backscattered from the remotely sensed scene observed through the integral transform operator (Henderson & Lewis, 1998), (Shkvarko, 2008). Such an operator is explicitly specified by the employed radar signal modulation and is traditionally referred to as the signal formation operator (SFO) (Shkvarko, 2006). The classical imaging with an array radar or SAR implies application of the method called "matched spatial filtering" to process the recorded data signals (Franceschetti et al., 2006), (Shkvarko, 2008), (Greco & Gini, 2007). A number of approaches had been proposed to design the constrained regularization techniques for improving the resolution in the SSP obtained by ways different from the matched spatial filtering, e.g., (Franceschetti et al., 2006), (Shkvarko, 2006(Shkvarko, , 2008, (Greco & Gini, 2007), (Plaza, A. & Chang, 2008), (Castillo Atoche et al., 2010a, 2010b but without aggregating the minimum risk descriptive estimation strategies and specialized hardware architectures via FPGA structures and VLSI components as accelerators units. In this study, we address a extended descriptive experiment design regularization (DEDR) approach to treat such uncertain SSP reconstruction problems that unifies the paradigms of minimum risk nonparametric spectral estimation, descriptive experiment design and worst-case statistical performance optimization-based regularization.

Problem statement
Consider a coherent RS experiment in a random medium and the narrowband assumption (Henderson & Lewis, 1998), (Shkvarko, 2006) that enables us to model the extended object backscattered field by imposing its time invariant complex scattering (backscattering) function e(x) in the scene domain (scattering surface) X  x. The measurement data wavefield u(y) = s(y) + n(y) consists of the echo signals s and additive noise n and is available for observations and recordings within the prescribed time-space observation domain Y = TP, where y = (t, p) T defines the time-space points in Y. The model of the observation wavefield u is defined by specifying the stochastic equation of observation (EO) of an operator form (Shkvarko, 2008):  (Henderson & Lewis, 1998), (Shkvarko, 2008)  , is referred to as the nominal SFO in the RS measurement channel specified by the time-space modulation of signals employed in a particular radar system/SAR (Henderson & Lewis, 1998), and the variation about the mean (,) S  yx = (y,x)S(y,x) models the stochastic perturbations of the wavefield at different propagation paths, where (y,x) is associated with zero-mean multiplicative noise (so-called Rytov perturbation model). All the fields , , enu in (2) are assumed to be zero-mean complex valued Gaussian random fields. Next, we adopt an incoherent model (Henderson & Lewis, 1998), (Shkvarko, 2006) of the backscattered field () e x that leads to the -form of its correlation function, R e (x 1 ,x 2 ) = b(x 1 )(x 1x 2 ). Here, e(x) and b(x) = <|e(x)| 2 > are referred to as the scene random complex scattering function and its average power scattering function or spatial spectrum pattern (SSP), respectively. The problem at hand is to derive an estimate ˆ( ) b x of the SSP () b x (referred to as the desired RS image) by processing the available finite dimensional array radar/SAR measurements of the data wavefield u(y) specified by (2).

Discrete-form uncertain problem model
The stochastic integral-form EO (2) to its finite-dimensional approximation (vector) form (Shkvarko, 2008) is now presented.
in which the perturbed SFO matrix represents the discrete-form approximation of the integral SFO defined for the uncertain operational scenario by the EO ( vector b at its principal diagonal), R n , and R u = <  e SR S  > p( Δ ) + R n , respectively, where <> p( Δ ) defines the averaging performed over the randomness of Δ characterized by the unknown probability density function p( Δ ), and superscript + stands for Hermitian conjugate. Following (Shkvarko, 2008), the distortion term Δ in (4) is considered as a random zero mean matrix with the bounded second-order moment   2 || || Δ . Vector b is composed of the elements, b k = () k e  = e k e k *  = |e k | 2 ; k = 1, …, K, and is referred to as a K-D vector-form approximation of the SSP, where  represents the second-order statistical ensemble averaging operator (Barrett & Myers, 2004). The SSP vector b is associated with the so-called lexicographically ordered image pixels (Barrett & Myers, 2004). The corresponding conventional K y K x rectangular frame ordered scene image B = {b(k x , k x ); k x , = 1,…,K x ; k v , = 1,…,K y } relates to its lexicographically ordered vector-form representation b = {b(k); k = 1,…,K = K y  K x } via the standard row by row concatenation (so-called lexicographical reordering) procedure, B = L{b} (Barrett & Myers, 2004). Note that in the simple case of certain operational scenario (Henderson & Lewis, 1998), (Shkvarko, 2008), the discrete-form (i.e. matrix-form) SFO S is assumed to be deterministic, i.e. the random perturbation term in (4) is irrelevant, Δ = 0. The digital enhanced RS imaging problem is formally stated as follows (Shkvarko, 2008): to map the scene pixel frame image B via lexicographical reordering B = L{b } of the SSP vector estimate b reconstructed from whatever available measurements of independent realizations of the recorded data vector u. The reconstructed SSP vector b is an estimate of the second-order statistics of the scattering vector e observed through the perturbed SFO (4) and contaminated with noise n; hence, the RS imaging problem at hand must be qualified and treated as a statistical nonlinear inverse problem with the uncertain operator. The high-resolution imaging implies solution of such an inverse problem in some optimal way. Recall that in this paper we intend to follow the unified descriptive experiment design regularized (DEDR) method proposed originally in (Shkvarko, 2008).

DEDR method 2.3.1 DEDR strategy for certain operational scenario
In the descriptive statistical formalism, the desired SSP vector b is recognized to be the vector of a principal diagonal of the estimate of the correlation matrix R e (b), i.e. b = {ˆe R } diag .
Thus one can seek to estimate b = {ˆe R } diag given the data correlation matrix R u pre- by determining the solution operator (SO) F such that where {·} diag defines the vector composed of the principal diagonal of the embraced matrix.
To optimize the search for F in the certain operational scenario the DEDR strategy was proposed in (Shkvarko, 2006) that implies the minimization of the weighted sum of the systematic and fluctuation errors in the desired estimate b where the selection (adjustment) of the regularization parameter  and the weight matrix A provide the additional experiment design degrees of freedom incorporating any descriptive properties of a solution if those are known a priori (Shkvarko, 2006). It is easy to recognize that the strategy (7) is a structural extension of the statistical minimum risk estimation strategy for the nonlinear spectral estimation problem at hand because in both cases the balance between the gained spatial resolution and the noise energy in the resulting estimate is to be optimized.

www.intechopen.com
From the presented above DEDR strategie, one can deduce that the solution to the optimization problem found in the previous study (Shkvarko, 2006) where represents the so-called regularized reconstruction operator; 1  n R is the noise whitening filter, and the adjoint (i.e. Hermitian transpose) SFO S + defines the matched spatial filter in the conventional signal processing terminology.

DEDR strategy for uncertain operational scenario
To optimize the search for the desired SO F in the uncertain operational scenario with the randomly perturbed SFO (4), the extended DEDR strategy was proposed in (Shkvarko, 2006) where the conditioning term (12) represents the worst-case statistical performance (WCSP) regularizing constraint imposed on the unknown second-order statistics <|| Δ || 2 > p( Δ ) of the random distortion component Δ of the SFO matrix (4), and the DEDR "extended risk" is defined by where the regularization parameter  and the metrics inducing weight matrix A compose the processing level "degrees of freedom" of the DEDR method.
To proceed with the derivation of the robust SFO (11), the risk function (13) was next decomposed and evaluated for its the maximum value applying the Cauchy-Schwarz inequality and Loewner ordering (Greco & F. Gini, 2007) of the weight matrix A   I with the scaled Loewner ordering factor  = min{   : A   I  } = 1. With these robustifications, the extended DEDR strategy (11) is transformed into the following optimization problem with the aggregated DEDR risk function The optimization solution of (14) follows a structural extension of (9) for the augmented (diagonal loaded)  R that yields

S S
www.intechopen.com represents the robustified reconstruction operator for the uncertain scenario.

DEDR imaging techniques
In this sub-section, three practically motivated DEDR-related imaging techniques (Shkvarko, 2008) are presented that will be used at the HW co-design stage, namely, the conventional matched spatial filtering (MSF) method, and two high-resolution reconstructive imaging techniques: (i) the robust spatial filtering (RSF), and (ii) the robust adaptive spatial filtering (RASF) methods. 1. MSF: The MSF algorithm is a member of the DEDR-related family specified for  >> ||S + S||, i.e. the case of a dominating priority of suppression of noise over the systematic error in the optimization problem (7). In this case, the SO (9) is approximated by the matched spatial filter (MSF): 2. RSF: The RSF method implies no preference to any prior model information (i.e., A = I) and balanced minimization of the systematic and noise error measures in (14) by adjusting the regularization parameter to the inverse of the signal-to-noise ratio (SNR), e.g.  = N 0 /B 0 , where B 0 is the prior average gray level of the image. In that case the SO F becomes the Tikhonov-type robust spatial filter in which the RSF regularization parameter  RSF is adjusted to a particular operational scenario model, namely,  RSF = (N 0 /b 0 ) for the case of a certain operational scenario, and  RSF = (N  /b 0 ) in the uncertain operational scenario case, respectively, where N 0 represents the white observation noise power density, b 0 is the average a priori SSP value, and N  = N 0 +  corresponds to the augmented noise power density in the correlation matrix specified by (16). 3. RASF: In the statistically optimal problem treatment,  and A are adjusted in an adaptive fashion following the minimum risk strategy, i.e.  A -1 = D = diag(b ), the diagonal matrix with the estimate b at its principal diagonal, in which case the SOs (9), (17) become itself solution-dependent operators that result in the following robust adaptive spatial filters (RASFs): for the certain operational scenario, and for the uncertain operational scenario, respectively. Using the defined above SOs, the DEDR-related data processing techniques in the conventional pixel-frame format can be unified now as follows www.intechopen.com with F (1) = F MSF ; F (2) = F RSF , and F (3) = F RASF , F (4) = F RASF , respectively. Any other feasible adjustments of the DEDR degrees of freedom (the regularization parameters , , and the weight matrix A) provide other possible DEDR-related SSP reconstruction techniques, that we do not consider in this study.

VLSI architecture based on Massively Parallel Processor Arrays
In this section, we present the design methodology for real time implementation of specialized arrays of processors in VLSI architectures based on massively parallel processor arrays (MPPAs) as coprocessors units that are integrated with a FPGA platform via the HW/SW co-design paradigm. This approach represents a real possibility for low-power high-speed reconstructive signal processing (SP) for the enhancement/reconstruction of RS imagery. In addition, the authors believe that FPGA-based reconfigurable systems in aggregation with custom VLSI architectures are emerging as newer solutions which offer enormous computation potential in RS systems. A brief perspective on the state-of-the-art of high-performance computing (HPC) techniques in the context of remote sensing problems is provided. The wide range of computer architectures (including homogeneous and heterogeneous clusters and groups of clusters, large-scale distributed platforms and grid computing environments, specialized architectures based on reconfigurable computing, and commodity graphic hardware) and data processing techniques exemplifies a subject area that has drawn at the cutting edge of science and technology. The utilization of parallel and distributed computing paradigms anticipates ground-breaking perspectives for the exploitation of high-dimensional data processing sets in many RS applications. Parallel computing architectures made up of homogeneous and heterogeneous commodity computing resources have gained popularity in the last few years due to the chance of building a high-performance system at a reasonable cost. The scalability, code reusability, and load balance achieved by the proposed implementation in such low-cost systems offer an unprecedented opportunity to explore methodologies in other fields (e.g. data mining) that previously looked to be too computationally intensive for practical applications due to the immense files common to remote sensing problems (Plaza & Chang, 2008).
To address the required near-real-time computational mode by many RS applications, we propose a high-speed low-power VLSI co-processor architecture based on MPPAs that is aggregated with a FPGA via the HW/SW co-design paradigm. Experimental results demonstrate that the hardware VLSI-FPGA platform of the presented DEDR algorithms makes appropriate use of resources in the FPGA and provides a response in near-real-time that is acceptable for newer RS applications.

Design flow
The all-software execution of the prescribed RS image formation and reconstructive signal processing (SP) operations in modern high-speed personal computers (PC) or any digital signal processors (DSP) platform may be intensively time consuming. These high computational complexities of the general-form DEDR-POCS algorithms make them definitely unacceptable for real time PC-aided implementation.
In this section, we describe a specific design flow of the proposed VLSI-FPGA architecture for the implementation of the DEDR method via the HW/SW co-design paradigm. The HW/SW co-design is a hybrid method aimed at increasing the flexibility of the implementation and improvement of the overall design process (Castillo Atoche et al., 2010a). When a co-processor-based solution is employed in the HW/SW co-design architecture, the computational time can be drastically reduced. Two opposite alternatives can be considered when exploring the HW/SW co-design of a complex SP system. One of them is the use of standard components whose functionality can be defined by means of programming. The other one is the implementation of this functionality via a microelectronic circuit specifically tailored for that application. It is well known that the first alternative (the software alternative) provides solutions that present a great flexibility in spite of high area requirements and long execution times, while the second one (the hardware alternative) optimizes the size aspects and the operation speed but limits the flexibility of the solution. Halfway between both, hardware/software co-design techniques try to obtain an appropriate trade-off between the advantages and drawbacks of these two approaches.
In (Castillo Atoche et al., 2010a), an initial version of the HW/SW-architecture was presented for implementing the digital processing of a large-scale RS imagery in the operational context. The architecture developed in (Castillo Atoche et al., 2010a) did not involve MPPAs and is considered here as a simply reference for the new pursued HW/SW co-design paradigm, where the corresponding blocks are to be designed to speed-up the digital SP operations of the DEDR-POCS-related algorithms developed at the previous SW stage of the overall HW/SW co-design to meet the real time imaging system requirements. The proposed co-design flow encompasses the following general stages: i. Algorithmic implementation (reference simulation in MATLAB and C++ platforms); ii. Partitioning process of the computational tasks; iii. Aggregation of parallel computing techniques; iv. Architecture design procedure of the addressed reconstructive SP computational tasks onto HW blocks (MPPAs);

Algorithmic implementation
In this sub-section, the procedures for computational implementation of the DEDR-related robust space filter (RSF) and robust adaptive space filter (RASF) algorithms in the MATLAB and C++ platforms are developed. This reference implementation scheme will be next compared with the proposed architecture based on the use of a VLSI-FPGA platform.

(ii) Partitioning process of the computational tasks
One of the challenging problems of the HW/SW co-design is to perform an efficient HW/SW partitioning of the computational tasks. The aim of the partitioning problem is to find which computational tasks can be implemented in an efficient hardware architecture looking for the best trade-offs among the different solutions. The solution to the problem requires, first, the definition of a partitioning model that meets all the specification requirements (i.e., functionality, goals and constraints). Note that from the formal SW-level co-design point of view, such DEDR techniques (20), (21), (22) can be considered as a properly ordered sequence of the vector-matrix multiplication procedure that one can next perform in an efficient high performance computational fashion following the proposed bit-level high-speed VLSI co-processor architecture. In particular, for implementing the fixed-point DEDR RSF and RASF algorithms, we consider in this partitioning stage to develop a high-speed VLSI co-processor for the computationally complex matrix-vector SP operation in aggregation with a powerful FPGA reconfigurable architecture via the HW/SW co-design technique. The rest of the reconstructive SP operations are employed in SW with a 32 bits embedded processor (MicroBlaze). This novel VLSI-FPGA platform represents a new paradigm for real time processing of newer RS applications. Fig. 1 illustrates the proposed VLSI-FPGA architecture for the implementation of the RSF/RASF algorithms. Once the partitioning stage has been defined, the selected reconstructive SP sub-task is to be mapped into the corresponding high-speed VLSI co-processor. In the HW design, the precision of 32 bits for performing all fixed-point operations is used, in particular, 9-bit integer and 23-bits decimal for the implementation of the co-processor. Such precision guarantees numerical computational errors less than 10 -5 referring to the MATLAB Fixed Point Toolbox (Matlab, 2011).

Aggregation of parallel computing techniques
This sub-section is focused in how to improve the performance of the complex RS algorithms with the aggregation of parallel computing and mapping techniques onto HWlevel massively parallel processor arrays (MPPAs). The basic algebraic matrix operation (i.e., the selected matrix-vector multiplication) that constitutes the base of the most computationally consuming applications in the reconstructive SP applications is transformed into the required parallel algorithmic representation format. A manifold of different approaches can be used to represent parallel algorithms, e.g. (Moldovan & Fortes, 1986), (Kung, 1988). In this study, we consider a number of different loop optimization techniques used in high performance computing (HPC) in order to exploit the maximum possible parallelism in the design: -Loop unrolling, -Nested loop optimization, -Loop interchange. In addition, to achieve such maximum possible parallelism in an algorithm, the so-called data dependencies in the computations must be analyzed (Moldovan & Fortes, 1986), (Kung, 1988). Formally, these dependencies are to be expressed via the corresponding dependence graph (DG). Following (Kung, 1988), we define the dependence graph G=[P, E] as a composite set where P represents the nodes and E represents the arcs or edges in which each e  E connects 12 , pp P that is represented as 12 ep p   . Next, the data dependencies analysis of the matrix-vector multiplication algorithms should be performed aimed at their efficient parallelization. , where y and ji a represents an n-dimensional (n-D) output vector and the corresponding element of A, respectively. The first SW-level transformation is the so-called single assignment algorithm (Kung, 1988), (Castillo Atoche et al., 2010b) that performs the computing of the matrix-vector product. Such single assignment algorithm corresponds to a loop unrolling method in which the primary benefit in loop unrolling is to perform more computations per iteration. Unrolling also reduces the overall number of branches significantly and gives the processor more instructions between branches (i.e., it increases the size of basic blocks). Next, we examine the computation-related optimizations followed by the memory optimizations. Typically, when we are working with nests of loops, we are working with multidimensional arrays. Computing in multidimensional arrays can lead to non-unit-stride memory access. Many of the optimizations can be perform on loop nests to improve the memory access patterns. The second SW-level transformation consists in to transform the matrix-vector single assignment algorithm in the locally recursive algorithm representation without global data dependencies (i.e. in term of a recursive form). At this stage, nestedloop optimizations are employed in order to avoid large routing resources that are translated into the large amount of buffers in the final processor array architecture. The variable being broadcasted in single assignment algorithms is removed by passing the variable through each of the neighbour processing elements (PEs) in a DG representation. Additionally, loop interchange techniques for rearranging a loop nest are also applied. For performance, the loop interchange of inner and outer loops is performed to pull the computations into the center loop, where the unrolling is implemented.

Architecture design onto MPPAs
Massively parallel co-processors are typically part of a heterogeneous hardware/softwaresystem. Each processor is a massive parallel system consisting of an array of PEs. In this study, we propose the MPPA architecture for the selected reconstructive SP matrix-vector operation. This architecture is first modelled in a processor Array (PA) and next, each processor is implemented also with an array of PEs (i.e., in a highly-pipelined bit-level representation). Thus, we achieved the pursued MPPAs architecture following the spacetime mapping procedures. First, some fundamental proved propositions are given in order to clarify the mapping procedure onto PAs. Proposition 1. There are types of algorithms that are expressed in terms of regular and localized DG. For example, basic algebraic matrix-form operations, discrete inertial transforms like convolution, correlation techniques, digital filtering, etc. that also can be represented in matrix formats (Moldovan & Fortes, 1986), (Kung, 1988). Proposition 2. As the DEDR algorithms can be considered as properly ordered sequences vector-matrix multiplication procedures, then, they can be performed in an efficient computational fashion following the PA-oriented HW/SW co-design paradigm (Kung, 1988). Following the presented above propositions, we are ready to derive the proper PA architectures. (Moldovan & Fortes, 1986) proved the mapping theory for the transformation where N represents the dimension of the DG (see proofs in (Kung, 1988) and details in (CastilloAtoche et al., 2010b). Second, the desired linear transformation matrix operator T can be segmented in two blocks as follows www.intechopen.com where Π is a (1×N)-D vector (composed of the first row of T ) which (in the segmenting terms) determines the time scheduling, and the (N -1)×N sub-matrix Σ in (24) is composed of the rest rows of T that determine the space processor specified by the so-called projection vector d (Kung, 1988).Next, such segmentation (24) yields the regular PA of (N-1)-D specified by the mapping where K is composed of the new revised vector schedule (represented by the first row of the PA) and the inter-processor communications (represented by the rest rows of the PA), and the matrix Φ specifies the data dependencies of the parallel representation algorithm. For a more detailed explanation of this theory, see (Kung, 1988), (CastilloAtoche et al., 2010b). In this study, the following specifications for the matrix-vector algorithm onto PAs . Now, for a simplified test-case, we specify the following operational parameters: m = n = 4, the period of clock of 10 ns and 32 bits data-word length. Now, we are ready to derive the specialized bit-level matrix-format MPPAs-based architecture. Each processor of the vector-matrix PA is next derived in an array of processing elements (PEs) at bit-level scale. Once again, the space-time transformation is employed to design the bit-level architecture of each processor unit of the matrix-vector PA.
The following specifications were considered for the bit-level multiply-accumulate architecture: for the vector schedule, for the projection vector and, for the space processor, respectively. With these specifications the transformation matrix becomes 12 01 The specified operational parameters are the following: l=32 (i.e., which represents the dimension of the word-length) and the period of clock of 10 ns. The developed architecture is next illustrated in Fig. 2. From the analysis of Fig. 2, one can deduce that with the MPPA approach, the real time implementation of computationally complex RS operations can be achieved due the highlypipelined MPPA structure.

Bit-level design based on MPPAS of the high-speed VLSI accelerator
As described above, the proposed partitioning of the VLSI-FPGA platform considers the design and fabrication of a low-power high-speed co-processor integrated circuit for the implementation of complex matrix-vector SP operation. Fig. 3 shows the Full Adder (FA) circuit that was constantly used through all the design. An extensive design analysis was carried out in bit-level matrix-format of the MPPAs-based architecture and the achieved hardware was studied comprehensively. In order to generate an efficient architecture for the application, various issues were taken into account. The main one considered was to reduce the gate count, because it determines the number of transistors (i.e., silicon area) to be used for the development of the VLSI accelerator. Power consumption is also determined by it to some extent. The design has also to be scalable to other technologies. The VLSI co-processor integrated circuit was designed using a Low-Power Standard Cell library in a 0.6µm double-poly triple-metal (DPTM) CMOS process using the Tanner Tools® software. Each logic cell from the library is designed at a transistor level. Additionally, S-Edit® was used for the schematic capture of the integrated circuit using a hierarchical approach and the layout was automatically done through the Standard Cell Place and Route (SPR) utility of L-Edit from Tanner Tools®.

Performance analysis 4.1 Metrics
In the evaluation of the proposed VLSI˗FPGA architectue, it is considered a conventional side-looking synthethic aperture radar (SAR) with the fractionally synthesized aperture as an RS imaging system (Shlvarko et al., 2008), (Wehner, 1994). The regular SFO of such SAR is factored along two axes in the image plane: the azimuth or cross-range coordinate (horizontal axis, x) and the slant range (vertical axis, y), respectively. The conventional triangular,  r (y), and Gaussian approximation,  a (x)=exp(-(x) 2 /a 2 ) with the adjustable fractional parameter a, are considered for the SAR range and azimuth ambiguity function (AF), (Wehner, 1994). In analogy to the image reconstruction, we employed the quality metric defined as an improvement in the output signal-to-noise ratio (IOSNR) IOSNR = 10 log 10 where k b represents the value of the kth element (pixel) of the original image B, () MSF k b represents the value of the kth element (pixel) of the degraded image formed applying the MSF technique (19), and () p k b represents a value of the kth pixel of the image reconstructed with two developed methods, p = 1, 2, where p = 1 corresponds to the RSF algorithm and p = 2 corresponds to the RASF algorithm, respectively. The quality metrics defined by (26) allows to quantify the performance of different image enhancement/reconstruction algorithms in a variety of aspects. According to these quality metrics, the higher is the IOSNR, the better is the improvement of the image enhancement/reconstruction with the particular employed algorithm.

RS implementation results
The reported RS implementation results are achieved with the VLSI-FPGA architecture based on MPPAs, for the enhancement/reconstruction of RS images acquired with different www.intechopen.com fractional SAR systems characterized by the PSF of a Gaussian "bell" shape in both directions of the 2-D scene (in particular, of 16 pixel width at 0.5 from its maximum for the 1K-by-1K BMP pixel-formatted scene). The images are stored and loaded from a compact flash device for the image enhancement process, i.e., particularly for the RSF and RASF techniques. The initial test scene is displayed in Fig. 4(a). Fig. 4(b) Fig.4 and Table 1, one may deduce that the RASF method over-performs the robust non-adaptive RSF in all simulated scenarios.

MPPA analysis
The matrix-vector multiplier chip and all of modules of the MPPA co-processor architecture were designed by gate-level description. As already mentioned, the chip was designed using a Standard Cell library in a 0.6µm CMOS process (Weste & D. Harris, 2004), (Rabaey et al., 2003). The resulting integrated circuit core has dimensions of 7.4 mm x 3.5 mm. The total gate count is about 32K using approximately 185K transistors. The 72-pin chip will be packaged in an 80 LD CQFP package and can operate both at 5 V and 3 V. The chip is illustrated in Fig. 5.   Fig. 4 and 5, one can deduce that the VLSI-FPGA platform based on MPPAs via the HW/SW co-design reveals a novel high-speed SP system for the real time enhacement/reconstruction of highly-computationally demanded RS systems. On one hand, the reconfigurable nature of FPGAs gives an increased flexibility to the design allowing an extra degree of freedoom in the partitioning stage of the pursued HW/SW co-design technique. On the other side, the use of VLSI co-processors introduces a low power, highspeed option for the implementation of computationally complex SP operations. The highlevel integration of modern ASIC technologies is a key factor in the design of bit-level MPPAs. Considering these factors, the VLSI/ASIC approach results in an attractive option for the fabrication of high-speed co-processors that perform complex operations that are constantly demanded by many applications, such as real-time RS, where the high-speed low-power computations exceeds the FPGAs capabilities.

Conclusions
The principal result of the reported study is the addressed VLSI-FPGA platform using MPPAs via the HW/SW co-design paradigm for the digital implementation of the RSF/RASF DEDR RS algorithms. First, we algorithmically adapted the RSF/RASF DEDR-related techniques over the range and azimuth coordinates of the uncertain RS environment for their application to imaging array radars and fractional imaging SAR. Such descriptive-regularized RSF/RASF algorithms were computationally transformed for their HW-level implementation in an efficient mode using parallel computing techniques in order to achieve the maximum possible parallelism in the design. Second, the RSF/RASF algorithms based on reconstructive digital SP operations were conceptualized and employed with MPPAs in context of the real time RS requirements. Next, the bit-level array of processors elements of the selected reconstructive SP operation was efficiently optimized in a high-speed VLSI architecture using 0.6um CMOS technology with low-power standard cells libraries. The achieved VLSI accelerator was aggregated with a reconfigurable FPGA device via HW/SW co-design paradigm. Finally, the authors consider that with the bit-level implementation of specialized arrays of processors in VLSI-FPGA platforms represents an emerging research field for the real-time RS data processing for newer Geospatial applications. In this book the reader will find a collection of chapters authored/co-authored by a large number of experts around the world, covering the broad field of digital signal processing. This book intends to provide highlights of the current research in the digital signal processing area, showing the recent advances in this field. This work is mainly destined to researchers in the digital signal processing and related areas but it is also accessible to anyone with a scientific background desiring to have an up-to-date overview of this domain. Each chapter is self-contained and can be read independently of the others. These nineteenth chapters present methodological advances and recent applications of digital signal processing in various domains as communications, filtering, medicine, astronomy, and image processing.