Open access peer-reviewed chapter

A Scalable, FPGA-Based Implementation of the Unscented Kalman Filter

By Jeremy Soh and Xiaofeng Wu

Submitted: March 8th 2018Reviewed: August 2nd 2018Published: November 5th 2018

DOI: 10.5772/intechopen.80681

Downloaded: 409

Abstract

Autonomous aerospace systems may well soon become ubiquitous pending an increase in autonomous capability. Greater autonomous capability means there is a need for high-performance state estimation. However, the desire to reduce costs through simplified development processes and compact form factors can limit performance. A hardware-based approach, such as using a field-programmable gate array (FPGA), is common when high performance is required, but hardware approaches tend to have a more complicated development process when compared to traditional software approaches; greater development complexity, in turn, results in higher costs. Leveraging the advantages of both hardware-based and software-based approaches, a hardware/software (HW/SW) codesign of the unscented Kalman filter (UKF), based on an FPGA, is presented. The UKF is split into an application-specific part, implemented in software to simplify the development process, and a non-application-specific part, implemented in hardware as a parameterisable ‘black box’ module (i.e. IP core) to increase performance. Simulation results demonstrating a possible nanosatellite application of the design are presented; implementation (synthesis, timing, power) details are also presented.

Keywords

  • field-programmable gate array (FPGA)
  • unscented Kalman filter (UKF)
  • codesign
  • system on a chip (SoC)
  • nonlinear state estimation

1. Introduction

Small (micro-, nano-, pico-) satellites and (micro-) unmanned aerial systems (UASs) are emerging technologies that have the potential to be of great academic and commercial use but only if a balance can be found between two diametrically opposed forces that act on their design: the desire, and need, for high performance and the desire to reduce costs. High performance, especially in state estimation, is necessary for these technologies to be advantageous over traditional aerospace systems in relevant applications.

The desire to reduce the costs of these technologies has led to their miniaturisation and heavy use of commercial off-the-shelf (COTS) components so that some level of economy of scale may be achieved. Though both component and development costs can be reduced in this way, this approach, in turn, leads to a reduction in the resources available (e.g. electrical power, computing power and physical space) aboard those systems, impacting performance.

Specialised hardware, e.g. application-specific integrated circuit (ASIC)- or field-programmable gate array (FPGA)-based systems, can achieve high performance, even for severely resource-constrained systems, but tends to increase the development complexity of these systems; in this way, using specialised hardware may reduce component costs and meet performance and miniaturisation requirements, while development costs are typically increased. This issue is illustrated in Figure 1, which depicts the balance between development complexity and performance for different embedded systems; greater complexity during the development process means that a greater investment in resources, personnel and time becomes necessary, which leads to higher development costs.

Figure 1.

The performance versus development complexity trade-off for different types of embedded systems.

Software approaches, e.g. microprocessor-based systems, generally have lower performance than specialised hardware but have much simpler, and thus cheaper, development processes. It is, however, possible to draw upon aspects of both hardware and software approaches and combine them into a hardware/software codesign. This codesign could deliver the high performance of specialised hardware but, by using software techniques, e.g. modularity or abstraction, could also alleviate some of the high development costs associated with such hardware. If this codesign approach is applied to a prolific state estimation algorithm, then the performance and miniaturisation requirements could be met, while keeping development costs low.

In this chapter, a library containing a scalable, hardware/software (HW/SW) codesign of the unscented Kalman filter (UKF), based on an FPGA, is presented. The codesign is implemented as a fully parameterisable, self-contained ‘black box’ (which is often referred to as an IP core), which aims to minimise the necessary input from system designers when applying the codesign to a new application, such that overall development complexity is reduced.

This chapter is a distillation of work first presented in [1]. The rest of this chapter is organised as follows: Section 2 provides background on relevant concepts, Section 3 outlines the proposed design, Section 4 describes the simulation model to verify the design and simulation results, Section 5 presents implementation results and Section 6 concludes the chapter.

2. Background

2.1 Hardware/software codesign

Hardware/software codesign is a design practice that is often used with system-on-a-chip (SoC) architectures. The term system on a chip comes from the field of very large-scale integration (VLSI) where individual hardware units or ‘black boxes’ (IP cores) that perform some dedicated function are arranged and connected together on a single ASIC chip. Typical SoCs may include a microcontroller or microprocessor core, DSPs, memories such as RAMs or ROMs, peripherals such as timers/counters, communication interfaces such as USB/UART, analog interfaces such as ADCs/DACs or other analog, digital, mixed-signal or RF hardware. Previously, each of these components may have had its own ASIC and was connected together on a PCB, but, in accordance with Moore’s law, resource densities of silicon chips have massively increased over time, so now these components are able to be integrated together on a single chip; an example SoC can be seen in Figure 2a.

Figure 2.

Examples of system-on-a-chip architectures.

(a) Example of a typical system on a chip.

(b) An example of a target architecture for early hardware/software codesign implementations.

SoC designs for FPGAs have become more popular recently as the increase in resource densities allowed more complex logic to be implemented. The push towards SoCs on FPGAs is driven by the desire for greater autonomy in a variety of systems; the most obvious example is field robotics, but autonomous systems such as ‘smart’ rooms and satellites also have a need for a small form factor and high-performance computing solutions that the FPGA is well placed to deliver.

As VLSI technology matured, designers began to see that the increase in development complexity for hardware or ASIC designs was impacting their ability to bring products to market quickly. The associated increase in the complexity of microprocessors led many designers to realise that these microprocessor units could be included into system designs and some of the functionality shifted to software to reduce their time to market. Microprocessors can be considered a SoC on their own, but they can also be included in much larger SoC designs, and this is where the idea of hardware/software codesign first began; reviews of the field by [2, 3, 4] give a comprehensive history of hardware/software codesign. A basic example of an architecture where hardware/software codesign may be appropriate is shown in Figure 2b.

2.2 Unscented Kalman filter

The extended Kalman filter (EKF) is, and has been, the most widespread method for nonlinear state estimation [5]. It has also become the de facto standard by which other methods are compared when analysing their performance. Various surveys of the field have noted that the EKF is ‘unquestionably dominant’ [6], ‘the workhorse’ of state estimation [7, 8] and the ‘most common’ nonlinear filter [9]. Despite some shortcomings, the relative ease of implementation and still-remarkable accuracy have propelled the EKF’s popularity.

However, more recently the unscented Kalman filter (UKF) [10] has been shown to perform much better than the EKF when the system models are ‘highly’ nonlinear [6, 7, 8, 9, 11]. While the EKF attempts to deal with non-linearities in the system model by using the Jacobian to linearise the system model, the UKF instead models the current state as a probability distribution with some mean and covariance. Following this, a deterministic sampling technique known as the unscented transform (UT) is applied. A set of points, called ‘sigma’ points, are drawn from the probability distribution, and each of them propagated through the nonlinear system models. The new mean and covariance of the transformed sigma points are then recovered to inform the new state estimate. The crucial aspect of the UT is that the sigma points are drawn deterministically, unlike random sampling methods like Monte Carlo algorithms, drastically reducing the number of points necessary to recover the ‘transformed’ mean and covariance.

Consider the general nonlinear system described for discrete time, k:

xk=fxk1uk1wk1E1
zk=hxkvkE2

where fand hare the system’s process and observation models, respectively; xand zare the state and observation vectors, respectively; uis the control input and wand vare, respectively, the process/control and measurement/observation noise, which are assumed to be zero-mean Gaussian white noise terms with covariances Qand R. The formalisation of the UKF for this system is as follows. Define an augmented state vector, xa, with length Mthat concatenates the process/control noise and measurement noise terms with the state variables as

xka=xkwkvkE3

The augmented state vector has an associated augmented state covariance, Pka, which combines the (regular) state covariance, Pk, with the noise covariances Qkand Rk. The augmented state vector and covariance are initialised with

x̂0a=Ex0a=x̂000P0a=Ex0ax̂0ax0ax̂0aT=P0000Q0000R0E4

where x̂0is the expected value of the initial (regular) state. There exist various other sigma point selection strategies, and, in order to minimise computational effort, a selection strategy involving a minimal set of samples is highly desired. The spherical simplex set of points [10, 12] can be shown to offer similar performance to the original UKF with the smallest number of sigma points required (M+2). The sigma point weights, and a coefficient matrix is generated by choosing 0W01and then calculating W1:

W1=Wi=1W0M+1i=1,,M+1E5

The choice of W0determines the spread of sigma points about the mean. Choosing W01reduces the spread, implying a greater confidence in the previous estimate, while the opposite is true when choosing W00. The vector sequence is initialised as

σ01=0,σ11=12W1,σ21=12W1E6

Then, the vector sequence is expanded for j=2,,Mvia

σij=σ0j10i=0σij11jj+1W1i=1,,j0j1jjj+1W1i=j+1E7

The actual sigma points are drawn, via a column-wise accumulation, from

χi,k=x̂ka+PkaσiE8

where irefers to the ith column of the matrix product and Pkarefers to the matrix ‘square root’. The matrix square root of a target matrix Ais a matrix Bthat satisfies A=BB; it is often calculated via the Cholesky decomposition [13].

Thepredictstep begins with the sigma points being propagated through the system model:

χi,kk1x=fχi,k1k1xuk1k1χi,k1k1wE9

The state and covariance are then predicted as

x̂k=i=0N1Wimχi,kk1xE10
Pk=i=0N1Wicχi,kk1xx̂kχi,kk1xx̂kTE11

For theupdatestep, the sigma points that were updated in thepredictstep are propagated through the observation model:

Zi,kk1=hχi,kk1xχi,k1k1vE12

The mean and covariance of the observation-transformed sigma points are calculated:

ẑkk1=i=0N1WimZi,kk1E13
Skk1=i=0N1Wicχi,kk1ẑkk1Zi,kk1ẑkk1TE14

followed by the cross-covariance

Pxz,kk1=i=0N1Wicχi,kk1xx̂kk1Zi,kk1ẑkk1TE15

and the Kalman gain

K=Pxz,kk1Skk11E16

Finally, the current system state is estimated by

x̂k=x̂kk1+Kz˜kẑkk1E17

where z˜is the current set of observations and the current covariance is updated with

Pk=Pkk1KSkk1KTE18
=Pkk1Pxz,kk1KTE19

where the expression for the Kalman gain, Eq. (16), is substituted.

3. Hardware/software codesign of the unscented Kalman filter

The first exercise in the hardware/software codesign is to divide the UKF algorithm into two parts. For maximum performance, it is desirable for as much of the algorithm as possible to be implemented in hardware. However, to maintain portability, any part of the algorithm that is application-specific would be better implemented in software. This is so that the application-specific parts can make use of the faster and simpler development processes that using software entails. Reviewing the UKF algorithm, only the two system models, thepredictandupdatemodels, are application-specific.

Apart from the two system models, the rest of the UKF can be viewed as, essentially, a series of matrix manipulations. The only changes to the rest of the UKF when either of the system models changes are the size of the vectors and matrices used in the UKF calculations. The sizes of these vectors and matrices are fixed for a particular formulation of the UKF, and so they can be treated as parameters that are set at synthesis. Fixing the parameters at synthesis means that only the bare minimum of hardware resources is needed, but the hardware can still be easily used for different applications with different vector/matrix sizes; rather than needing to redesign any functionality, the hardware can simply be synthesised with different parameters. Thus, the rest of the UKF can be designed and then used as a parameterisable, modular ‘black box’ (IP core), and implementing it for any given application only requires the appropriate selection of parameters.

When it comes to designing hardware, there are three main considerations: the performance of the hardware (which may include data throughput, maximum clock frequency, etc.), the logic area (on-chip resources) used and the power consumption of the hardware. During development these considerations are usually at odds with each other—specifically performance is usually opposed by logic area and power consumption. In order to increase the performance of the design, additional resources often have to be used which, in turn, may increase the power consumption; these increases in resource/power cost may make the implementation infeasible for a given application. Due to these considerations, it is beneficial to give system designers the ability to scale resource use to the requirements of their particular application.

The actual physical implementation of the hardware/software UKF on an FPGA can be seen in Figure 3. The hardware part is implemented as a stand-alone IP core, and the software part is implemented on a general-purpose microprocessor. The processor acts as the main controller which, in addition to implementing the system model software, controls the hardware IP core. The precise method of controlling the IP core is dependent on the design variation and is elaborated on in the following sections.

Figure 3.

The hardware/software partition on the FPGA.

The processor communicates with the IP core over some communication interface. Any intra-chip communication method would be sufficient and would be driven mostly by the requirements of the application; viable interfaces include point-to-point, bus or NoC interfaces. The IP core contains memory buffers at the interface in order to receive data from the processor as well as to temporarily store data that needs to be read by the processor. The communication interface is the same between all three variants, but the specifics of the memory buffers are not.

Here, the communication interface between the two parts is an AXI4 bus. All variants are implemented using single-precision arithmetic (IEEE-754); this gives a decent balance of dynamic range and resource usage which should be sufficient for the majority of applications. All hardware in the codesign is developed using the Verilog HDL, and all software in the codesign is developed using C. Although C is used here, in general, any type of software may be used as long as it contains the ability to interact with the communication interface connecting the hardware and software parts.

3.1 Overall design

The codesign utilises the main benefit of hardware implementations: wide parallelism. An increase in performance is gained by encapsulating certain parts of the major datapaths into a sub-module called a processing element (PE) and then using multiple instances of these PEs in parallel, allowing multiple elements of an algorithm to be calculated at once. The increase in resources used is not only for the extra processing elements but also in the additional overhead needed to deal with the parallel memory structure that is also necessary to feed to the parallel processing elements. The number of PEs used in the design is parameterisable, allowing for some trade-offs by the system designer between resources used and performance.

The codesign logically separates the UKF algorithm into three parts, while on the hardware side, the IP core consists of the UKF hardware and a memory buffer which is attached to a communication (AXI4) bus; the top-level block diagram of the codesign can be seen in Figure 4. The memory buffer has a memory map that ensures that data are coherent between the processor and the IP core and also incorporate a control register which allows the IP core to be controlled. The control register allows the processor to reset or enable the IP core as a whole as well as start one of the core’s functional steps via a state machine; the control register also records the current state of the IP core. Data required by the IP core (e.g. transformed sigma points) must be placed in the memory buffer at the appropriate address by the processor before signalling the IP core to begin its calculations. The control register may be polled by the processor to control the IP core; alternatively, the core may also be configured with an optional interrupt line that may be attached to the processor’s interrupt controller or external interrupt lines.

Figure 4.

Top-level block diagram.

3.2 Sigma point generation

Thesig_genstep uses the current augmented state vector and covariance to calculate the new sigma points via (Eq. (8)). To calculate the new set of sigma points, first the matrix ‘square root’ of the current augmented covariance must be calculated which is implemented by thetrisolvemodule. The ‘square root’ of the augmented covariance is then multiplied by the sigma coefficients weighting matrix and the current augmented state vector added column-wise; this is implemented by the matrix multiply-add module described in Section 3.2.2. After the sigma points are calculated, they are written to the memory buffer; a control bit is set to signify completion to the processor; and if the interrupt line is included, an interrupt generated. A block diagram of this step can be seen in Figure 5 showing the data flow between modules.

Figure 5.

Block diagram of the sig_gen step.

The memory prefetch and memory serialiser modules add a small amount of overhead to thesig_genstep but are necessary due to the matrix multiply-add featuring a parallelised datapath but the memory buffer requiring serial access.

3.2.1 Triangular linear equation solver

In addition to the matrix ‘square root’, the Cholesky decomposition is also used in the Kalman gain calculation which involves a matrix inversion (see Eq. (16)). Directly computing a matrix inversion is extremely computationally demanding; so rather than directly inverting the matrix, an algorithm called the matrix ‘right divide’ is used here. For positive definite matrices, this algorithm involves using the Cholesky decomposition to decompose the target matrix into a triangular form followed by forward elimination and then back substitution to solve the system; this sequence may be treated as solving a series of triangular linear equations meaning the same hardware can be reused for each operation [14]. The Cholesky decomposition of a target matrix A, which is positive definite, is given by

A=L1L1TE20

where L1is lower triangular. Reducing the calculation to a series of triangular linear equations involves using an alternative version:

A=L2DL2TE21

where L2is lower triangular and its diagonal terms are unit elements, Dis diagonal and the two versions are related by

L1=L2DE22

The recombination process is necessary because of the subsequent matrix multiply-add between the augmented covariance square root and the sigma weighting coefficient matrix (see Eq. (8)). The element-wise calculation for L2and Dis given by

Fij=Aijk=1j1LikFjkfori>jE23
Dj=Ajjk=1j1Fjk2DkE24

where Fij=LijDjand Fjk=LjkDkare substituted to simplify the calculation further. Figure 6 depicts the fulltrisolvedatapath, including the division and the recombination process to recover L1. The input bis either Aijin the Cholesky decomposition or an element from the divisor matrix in the matrix ‘right divide’.

Figure 6.

Triangular linear equation solver.

The fused multiply-add module and feedback FIFO have been encapsulated to form an elementary block of hardware called a processing element (PE) which can be instantiated multiple times in parallel. The PE output to a demultiplexer, which ensures values, is passed to the subsequent calculations in the correct order. The latter calculations, after the demultiplexer, are not parallelised because these calculations require much fewer operations, and so parallelisation is not necessary.

3.2.2 Matrix multiply-add

The matrix multiply-add datapath is a standard ‘naive’ element-wise multiplication and accumulation. However, the hardware to calculate one element is enclosed as one processing element, and additional PEs are added to handle calculations in parallel; the matrix multiply-add datapath can be seen in Figure 7. Each PE is responsible for calculating at least one row of the result matrix. The elements of the matrix to be added can simply be injected into the accumulation directly, instead of performing an additional matrix addition after a matrix multiplication.

Figure 7.

Matrix multiply-add operation.

3.3 Predict step

Thepredictstep uses the transformed sigma points to calculate the a priori state estimate; the architecture for thepredictstep can be seen in Figure 8 showing how data flows between each module.

Figure 8.

Block diagram of the predict step.

The processor may initiate apredictstep once it has placed valid transformed sigma points into the memory buffer. The prefetch module fetches the transformed sigma points from the memory buffer and places them into a parallel memory structure. The mean of the transformed sigma points is calculated which is also the a priori state estimate (10). The transformed sigma points and the mean are then used to calculate the ‘sigma point residuals’ via asubtractoperation. From the ‘sigma point residuals’, the covariance of the set of transformed sigma points is calculated which is also the a priori covariance (11). Calculation of the mean and covariance is implemented by the calculate mean/covariance module described in Section 3.3.1; this section also describes the details of the ‘sigma point residuals’. Once the calculations are complete, the IP core writes the a priori state and covariance to the memory buffer so that both the processor and IP core have the current state estimate. Once thepredictstep is completed, a control bit is set to notify the processor, and, if included, an interrupt generated.

3.3.1 Calculation of mean/covariance

Calculating the mean and covariance of the transformed sigma points is both very similar, meaning both can be calculated by the same datapath. Consider the calculation for the mean of thepredictstep transformed sigma points:

x̂k=i=1NWiχixE25

This is a simple column-wise multiply-accumulation. Consider the calculation of the covariance:

Pk=i=1NWiχixx̂χixx̂TE26

The subtraction looks like it will cause inefficiencies in the datapath, similar to the division operation in the original Cholesky decomposition. However, let χ˜i=χixx̂be the ith column of χ˜, and then the covariance calculation reduces to

Pk=i=1NWiχ˜iχ˜iTE27

This ‘sigma point residual’ matrix χ˜is of size Mstate×Nwhere Mstateis the number of state variables and N=M+2is the number of sigma points. The element-wise calculation is then

Pij=k=1NW1χ˜ikχ˜jkE28

This expression involves two multiplications followed by an accumulation; if these ‘sigma point residuals’ are calculated first witha subtractoperation, then both the mean and covariance calculations simply involve a series of multiplications and accumulation. Similar to the matrix multiply-add operation, the basic calculations are encapsulated into one processing element, and then additional PEs are added to the datapath in order to calculate additional rows in parallel; the datapath can be seen in Figure 9.

Figure 9.

Calculate mean/covariance operation. W refers the sigma point weights W 0 , W 1 .

The input to the datapath is either the transformed sigma points to calculate the mean or the residuals to calculate the covariance. The FIFO is used to skip the first multiplication when calculating the mean; the multiplexer selects which value is calculated.

3.4 Update step

Theupdatestep corrects the a priori state estimate with a set of observations to generate the new state estimate. Many of the calculations in theupdatestep are very similar to thepredictstep; the architecture for theupdatestep can be seen in Figure 10 showing the data flow between modules.

Figure 10.

Block diagram of the update step.

As with thepredictstep, the processor must first place the valid transformed sigma points into the memory buffer before signalling the IP core to begin. First, the prefetch module converts the transformed sigma points into a parallel memory structure. The mean and ‘sigma point residuals’ are calculated and then used to calculate the observation covariance. Theupdate‘sigma point residuals’ are also combined with thepredict‘sigma point residuals’, which were calculated during thepredictstep, to calculate the cross-covariance between the two system models. The observation residual, z˜ẑ(17), is calculated with the current set of observations in the memory buffer. The observation and cross-covariance are used to calculate the Kalman gain before the matrix multiply-add modules use the Kalman gain and the a priori state estimate and covariance to calculate the new state estimate and covariance. The new estimates overwrite the a priori estimates in the internal memory and are also written into the memory buffer such that both the core and the processor have the most recent estimate. The core notifies the processor upon completion, setting a control bit and/or generating an interrupt.

4. Testing and validation of the hardware/software codesign

To validate the implementation of the hardware/software UKF and demonstrate its effectiveness, an example application is presented. This demonstration emulates the attitude determination subsystem of a single, uncontrolled nanosatellite. It is envisioned that a system designer looking to use the HW/SW UKF in a new application simply formulates the UKF appropriately for that application, i.e. formulates the system models (1), (2) and sets the algorithm parameters. Then, once the UKF algorithm has been defined, the HW/SW codesign detailed in Section 3 can then be used to actually implement the UKF and accelerate its performance. The example application attempts to employ this process.

The UKF was implemented using a number of methods for validation and comparison purposes. Once formulated, the UKF was first implemented in MATLAB (SW) to validate the design of the UKF algorithm. Next, the UKF was implemented again using the HW/SW codesign on an FPGA development board in order to validate the codesign. Finally, the UKF was implemented a third time in C (SW), but on the same FPGA development board, to provide a performance benchmark the HW/SW codesign could be compared to.

The FPGA development board used was the Zedboard, featuring a Xilinx Zynq-7000 series XC7Z020, seen in Figure 11. The relevant features of the board are:

  • Dual ARM Cortex-A9 processor system (PS) @ 667 MHz

  • The equivalent of an Artix-7 device in programmable logic (PL)

  • AXI4 PS-PL interface

Figure 11.

Zedboard development board used for two of the UKF implementations.

The HW/SW codesign was implemented for the 1 PE and 2 PE cases. The hardware part of the codesign, the IP core, was developed in Verilog and synthesised and implemented using Vivado 2014.1; basic arithmetic was implemented using floating point IP cores from Xilinx’s IP catalogue. All designs used a single-precision (IEEE 754–2008) number representation. The target synthesisable frequency for the IP core was 100 MHz, and the 2 PE case instantiated two processing elements for the whole design (i.e. each individual module had two processing elements). The software part of the codesign was implemented in C as bare-metal application on the processor system. The general-purpose AXI4 interface between the PS and the PL was used by the two parts to communicate with each other (@ 100 MHz as well). The C (SW) implementation of the UKF was a bare-metal application that used the GNU Scientific Library (GSL) for its vector and matrix manipulations. All software was compiled using the -O2 optimisation flag.

To test the different UKF implementations, a simulator was constructed in MATLAB to model the nanosatellite’s motion; the details are given in Section 4.5. Simulated sensor measurements were generated from the nanosatellites’ motion and passed to each of the three UKF implementations, which act as the attitude determination subsystem. For the MATLAB implementation, the simulated measurements could be passed directly. For the HW/SW codesign and the C (SW) implementations, the simulated measurements were first exported to a C header which was included during compilation.

4.1 System model

The nanosatellite is modelled as a 1 U CubeSat. The attitude of the nanosatellite is represented by the unit quaternion q=qq0Twhere q=q1q2q3Tand which satisfies q12+q22+q32+q02=1.

The kinematic equations for the satellite in terms of quaternions are given by

q̇=12q0I3×3+q×ωE29
q̇0=12qTωE30

where ωis the angular rate and q×is the skew-symmetric matrix of qgiven by

q×=0q3q2q30q1q2q10E31

4.2 Sensor model

We consider a basic sensor set common on nanosatellites—a three-axis MEMS IMU including an accelerometer, gyroscope and magnetometer. We use the standard gyroscopic model for the gyroscope:

zg=ωT+β+ηgE32
β̇=ηdE33

where ωTis the true angular velocity, βis the gyroscopic bias, β̇is the gyroscopic bias drift and ηg,ηdare noise terms that are assumed to be zero-mean Gaussians. Similarly, we model the accelerometer and magnetometer as

za=aT+ηaE34
zm=mT+ηmE35

where aTis the true local acceleration vector, mTis the true local magnetic vector and ηa,ηmare, again, zero-mean Gaussian measurement noise terms.

4.3 Predict model

We use a dead-reckoning model and the gyroscopic data to predict the motion of the nanosatellite. However, it is necessary to account for the gyroscopic bias drift, so we estimate the current gyroscopic bias as well. Let the state vector be

x=qq0βTE36

The predict model, f, is then

fχk1k1xχk1k1w=χk1k1x+f'χk1k1xχk1k1wdtE37
fχk1k1xχk1k1w=12q0I3×3+q×zg12qTzg03×1+wkE38

where dtis the time step between samples, wk=ηqβ̇Tis the process noise and ηqis assumed to be a zero-mean Gaussian.

4.4 Update model

The accelerometer and magnetometer data are used to correct for the gyroscopic bias, so the observation model, h, is

hχk1k1xχk1k1v=AqqgbaAqqbm+vkE39

where baand bmare the respective body frame vectors, gis the magnitude of the gravity vector (assumed 8.94m.s2at an altitude of 300 km), vk=ηaηmis the measurement noise and Aqqis the rotation matrix between the body frame and the local frame given by

Aqq=q02+q12q22q322q1q2q0q32q0q2+q1q32q1q2+q0q3q02q12+q22q322q2q3q0q12q1q3q0q22q0q1+q2q3q02q12q22+q32E40

4.5 Simulation model

Collecting all the relevant terms, the initial augmented state vector is given by

x0a=qq0β04×103×103×103×1T,E41

and the initial augmented covariance is a diagonal matrix with diagonal terms:

diagP0a=16×1ηqβ̇ηaηmE42

The state vector length is 7, the number of observation variables is 6and the augmented state vector length is 20. The quaternion noise term was modelled with covariance ηq=106. The simulated sensor set was homogeneous, so the modelled errors are the same for each nanosatellite. The gyroscopic bias drift was modelled with covariance ηd=102°/s2. The measurement noise terms were modelled with covariances: ηg=101°/s,ηa=102g,ηm=102gauss. The satellite was modelled as undergoing slow tumbling. The motion was modelled using the Euler angles in a local ground frame, which is relevant in most remote sensing applications; here, we use roll-pitch-yaw to refer to rotations about the x-y-z axis, respectively.

To generate the sensor measurements, the simulated motions were converted into the body frame via rotation matrix with 1–2-3 referring to roll-pitch-yaw, respectively:

Aeuler=c1c2c1s2s3s1c3s1s3+c1s2c3s1c2s1s2s3+c1c3s1s2s3c1s3s2c2s3c2c3E43

It is assumed that the magnetometer is aligned with the x-axis (bm=1,0,0) and the accelerometer is aligned with the z-axis (ba=0,0,1). Next, using the sensor models described earlier, noise terms were added to the sensor ‘truth’ data which was then sampled at 1 Hz to simulate measurements from an actual set of sensors.

4.6 Results

The UKF was simulated in MATLAB environment as well as on the Zedboard development board. For the Zedboard implementations, the simulated sensor data set was loaded into the onboard memory (RAM) and the UKF simulated as if it were receiving data from the actual sensors. The data set used in all three implementations was the same. State estimates from the UKF were stored on the Zedboard for the duration of the simulation and then read back into MATLAB afterwards for analysis.

All three implementations produced (within working precision) the simulation results in Figure 12a and b; these figures show the absolute attitude error (i.e. the difference between the UKF estimated attitude and the simulated ‘truth’) of the nanosatellite. In Figure 12a, the top graph shows the first tenth of a second of the simulation, highlighting early convergence of the filter to the truth from an initial noisy estimate. The bottom figure shows the first second of the simulation, highlighting the ability of the filter to maintain its accuracy (<0.1°error) after convergence. Figure 12b shows that the UKF is able to correct for the inaccuracies arising from the gyroscopic bias and bias drift over the full duration of the simulation.

Figure 12.

Absolute attitude error.

(a) At the very beginning of the simulation.

(b) For the full simulation.

These results demonstrate that there are no implementation issues when taking the UKF to a HW/SW codesign; the codesign, and IP core, is able to completely replicate software-based implementations of the UKF. The overall latency of the C (SW) implementation and the HW/SW codesign (serial and parallel) for the single nanosatellite case were measured using the ARMv7 Performance Monitoring Unit (PMU) and can be seen in Table 1. This overall latency is the time taken to complete one full iteration of the UKF (all steps). The 1 PE case offers a modest 1.8×increase in performance over the C (SW) implementation and can be run at 2kHz which is more than adequate for the sampling frequency assumed by the simulation. The 2 PE case offers a slightly better 2.4×speed-up over the C (SW) implementation. Note that the processor system operates at a clock frequency more than six times the frequency used by the IP core (667 vs. 100 MHz), yet the IP core is still able to outperform the C (SW) implementation.

SW1 PE2 PE
Total660363272

Table 1.

Overall latency for the single nanosatellite. All values in μs.

5. Implementation analysis of the hardware/software codesign

Synthesis and implementation runs were targeted at the Zynq-7000 XC7Z045 at a target frequency of 100 MHz. Though the implementations of the example applications presented in Section 4 was for the Zynq-7000 XC7Z020, the codesign does not fit on this device for larger numbers of processing elements. In order to still compare implementation details, this larger device in the Zynq-7000 family is used instead. All the devices in the Zynq-7000 family feature the same processing system; the only difference for larger devices is the amount of programmable logic available.

Resource utilisation of the device by the IP core is reported by Vivado post-implementation. The power analysis is done via the Xilinx Power Estimator (XPE) post-implementation; all power estimates exclude the device static power dissipation and the processing system power draw.

The execution time (latency) for the hardware part is measured via behavioural simulation in Vivado Simulator, assuming a clock frequency of 100 MHz; this assumption was validated post-implementation for all designs. Though behavioural simulations are usually used for only functional verification, Vivado Simulator provides cycle-accurate execution times as long as timing assumptions made in the simulation are verified post-implementation. The entire IP core utilises synchronous logic and is on a single clock domain which makes confirming the proper distribution of the assumed clock signals, in this case 100 MHz, relatively straightforward.

The execution time (latency) of the software part is measured via the ARMv7 Performance Monitoring Unit (PMU), which counts processor clock cycles between two epochs; because the number of processor clock cycles to perform a given task can vary, each measurement was conducted at least 10 times, and the average latency measured is reported here.

5.1 Synthesis results

Synthesis results for a selection of different numbers of processing elements can be seen in Table 2. These results do not include the processor but do include the logic necessary for the AXI4 interface ports. The initial numbers of PEs were chosen to be multiples of the number of augmented state variables so that the major datapaths remained data efficient. Recall, for example, the matrix multiply-add datapath (Section 3.2.2); each PE calculates an entire row in the result matrix. If the number of PEs is not a multiple of the size of the matrix, then the last iteration of the calculations will not have enough data to fill all the PEs making the datapath slightly inefficient.

Resource1 PE2 PE5 PE10 PE
FF7668 (2)14,286 (3)27,311 (6)48,714 (11)
LUT5764 (3)15,158 (7)29,500 (13)53,427 (24)
BRAM16.5 (3)36.5 (7)62 (11)109.5 (20)
DSP4835 (4)62 (7)104 (12)182 (20)

Table 2.

Resource utilisation (% total) on the XC7Z045.

For low numbers of processing elements, the codesign utilises a relatively small percentage of the available resources. The XC7Z045 is a mid-range device in the Zynq-7000 series which means even the 10 PE case still only uses a quarter of the available LUTs. The codesign does not require a proportionally large amount of any one resource; in fact, the design uses a disproportionately smaller amount of FFs than other resources. This will allow easier integration into a full SoC, particularly if partially reconfigurable regions are used. Requiring too much of any one resource type can lead to placement and routing issues since resource on-chip locations are fixed by the manufacturer. This also implies that additional register stages could be added to major datapaths, which would increase the overall latency but could allow an increase in clock frequency as well. If the increase in clock frequency was greater than the increase in latency, the overall performance of the design would benefit.

5.2 Power consumption

A power consumption breakdown for the hardware IP core (i.e. excluding the processor) can be seen in Table 3. The power consumption for low numbers of PEs is reasonably low, due to the area efficiency design goals and the heavy utilisation of the FPGA clock that enable resources to disable modules that are not currently in use. For reference, the device static power consumption (@ 25°C) is 245mW, and the rough power consumption of the processing system is 1.5W. A conservative estimate of the electrical power available to a CubeSat is in the order of 1–2 W per unit [15]; larger 2–3 U or more CubeSats have a greater surface area to cover in solar panels. The 1 PE case could be incorporated into a 1 U or larger CubeSat with relative ease, but even for just the 2 PE case, a 2 U CubeSat or larger may be necessary.

Resource1 PE2 PE5 PE10 PE
Clocks3874136234
Signals2483144261
Logic2376126219
BRAM5182112209
DSP462152
Total140336549975

Table 3.

Power consumption of the codesign. All values in mW.

5.3 Timing analysis

A breakdown of the execution time (latency) of different modules can be seen in Table 4. The design spends a large amount of the time propagating the sigma points through the two system models. The majority of the time spent by the design is actually in these system models, making the software part the main bottleneck. Looking at the sigma point propagation process a little closer, however, the latency of reading the sigma points from the memory buffer and of writing the transformed points back to the memory buffer was 116 μs. The actual calculation of the system models took a mere 21 μs. So, the bottleneck is actually the speed of the AXI4 port in transferring data between the processor and the memory buffer. Using a higher-performance communication, bus or other techniques such as direct memory access (DMA) ports may alleviate this issue, but intra-chip communication methods are beyond the scope of this chapter.

SW1 PE2 PE5 PE10 PE
Sig. gen.926151
System model52137137137137
Predict522170138.56.5
Update8756302117
Total660363272228212

Table 4.

Latency of each stage for the codesign. System models encompass propagation through both thepredictand theupdatemodels on the processor. All values in μs.

For the hardware part, the majority of time is spent in thesig_genstep. The two modules in thesig_genstep, the triangular linear equation solver and the matrix multiply-add, are both large matrix operations which scale with the number of augmented state variables. Operations in thepredictandupdatesteps tend to scale with the number of state or observation variables, respectively, which are always necessarily smaller than the number of augmented state variables. It should be noted that the hardware part appears to suffer from diminishing returns with regard to decreasing the latency as the number of processing elements increases.

6. Conclusion

In this chapter, a scalable FPGA-based implementation of the unscented Kalman filter was presented. The proposed design balances development effort/complexity with performance, combining the advantages of both the traditional software approach and hardware approaches to create a design that system designers can easily use in a potentially wide variety of applications. Simulation and physical implementation results of the codesign were presented. The demonstration application simulated the attitude determination system of an uncontrolled nanosatellite, and the physical implementation was performed on the Xilinx XC7Z045.

© 2018 The Author(s). Licensee IntechOpen. This chapter is distributed under the terms of the Creative Commons Attribution 3.0 License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

How to cite and reference

Link to this chapter Copy to clipboard

Cite this chapter Copy to clipboard

Jeremy Soh and Xiaofeng Wu (November 5th 2018). A Scalable, FPGA-Based Implementation of the Unscented Kalman Filter, Introduction and Implementations of the Kalman Filter, Felix Govaers, IntechOpen, DOI: 10.5772/intechopen.80681. Available from:

chapter statistics

409total chapter downloads

1Crossref citations

More statistics for editors and authors

Login to your personal dashboard for more detailed statistics on your publications.

Access personal reporting

Related Content

This Book

Next chapter

Novel Direct and Accurate Identification of Kalman Filter for General Systems Described by a Box-Jenkins Model

By Rajamani Doraiswami and Lahouari Cheded

Related Book

Advances in Wavelet Theory and Their Applications in Engineering, Physics and Technology

Edited by Dumitru Baleanu

First chapter

Real-Time DSP-Based License Plate Character Segmentation Algorithm Using 2D Haar Wavelet Transform

By Zoe Jeffrey, Soodamani Ramalingam and Nico Bekooy

We are IntechOpen, the world's leading publisher of Open Access books. Built by scientists, for scientists. Our readership spans scientists, professors, researchers, librarians, and students, as well as business professionals. We share our knowledge and peer-reveiwed research papers with libraries, scientific and engineering societies, and also work with corporate R&D departments and government entities.

More About Us