Solving Partial Differential Equation Using FPGA Technology

This chapter introduces the method of using CNN technology on FPGA chips to solve differential equation with large space, with lager computing space, while limitation of resource chip on FPGA is needed, we have to find solution to separate differential space into several subspaces. Our solution will do: firstly, division of the computing space into smaller areas and combination of sequential and parallel computing; secondly, division and combination of boundary areas that are required to be continuous to avoid losing temporary data while processing (using buffer memory to store); and thirdly, real-time data exchange. The control unit controls the activities of the whole system set by the algorithm. We have configured the CNN chip for solving Navier-Stokes equation for the hydraulic fluid flow successfully on the Virtex 6 chip XCVL240T-1FFG1156 by Xilinx and giving acceptance results as well.


Introduction
Solving the partial differential equation (PDE) has been investigated by many researchers, implementing digital decoding on PCs successfully. However, with the problem of large computing space, the resolution on the PC is difficult to meet the requirements of speed and accuracy calculations; in some cases, the problem cannot be solved because of the calculation. Cellular Neural Network technology (CNN) researchers have applied cellular neural network (CNN) technology successfully to perform analysis of the problem, design CNN chip, and solve some PDEs.
Using CNN technology for solving PDE, we have to analyze and difference the original particular equations of problem, find templates, design CNN architecture, and then configure FPGA to make a CNN chip. It means that there is no CNN chip for every equation, but for each problem (consist of some equations), there is need to design appropriate CNN chip. When solving large problems, computing resources are needed to configure blocks of CNN chips. In order to save resources, we have proposed a solution for dividing computing space into smaller subspaces and composite parallel and sequential calculations, which ensures high computing rates but saves resources of FPGA chips used.
Because the architecture of CNN chips varies depending on each problem, making the CNN chip is very difficult and costly with traditional methods. Using the FPGA technology, users can use hardware programming languages, such as Verilog and VHDL, to configure the logic elements in the FPGA to produce the electronic circuit of a CNN chip. The recent FPGA architectures (Virtex 7; Stratix 10) have many tools support to test, optimize, and coordinate data exchange. The CNN designer should use FPGA for making a CNN chip.

CNN and FPGA technology 2.1 Cellular neural network technology
Cellular neural network (CNN) was introduced by Chua and Yang at Berkeley University, California (USA), in 1988, which combined both analog spatial temporal dynamic and logic [1][2][3]. The CNN paradigm is a natural framework to describe the behavior of locally interconnected dynamic systems, which has an arrayed structure, so it is very useful in solving the partial differential equations [3][4][5][6][7]. Today, visual microprocessors based on this processing type can perform at TeraOPs computing power and approximately 50,000 fps. The possibility of developing algorithms and programs based on CNN was quickly exploited worldwide. Up to now, there are several CNN models for processing images, solving PDE, recognizing pattern, gene analysis, etc. Depending on problems, the designer can make a CNN chip having size of millions cells. The common CNN architectures are 1D, 2D, and 3D.
The standard CNN 2D is the dynamic system of autonomous cells that are connected locally with its neighbor forming a two-dimensional array [2,18]. Each cell in the array C(i,j) contains one independent voltage source, one independent current source, a linear capacitor, resistors, and linear voltage-controlled current sources which are coupled to its neighbor cells via the controlling input voltage, and the feedback from the output voltage of each neighbor cell C(k,l). The templates A(i,j;kl) and B(i,j;k,l) are the parameters linking cell C(i,j) to neighbor C(k,l). The effective range of Sr(I,j) on radius r of cell C(I,j) is identified by the set of neighbor cells which satisfies ( Figure 1).
The state equation of cell C(i,j) is given by the following equation: With R, C is the linear resistor and capacitor; A(i,j;kl) is the feedback operator parameter; B(i,j;kl) is the control parameter; and zij is the bias value of the cell C(i,j). On the CNN chip, (A, B, z) are the local connective weight values of each cell C(i,j) to its neighbors. The output of the cell C(i,j) is presented by Y ij as: The characteristic of the CNN output function Y i,j = f(x ij ) is presented in Figure 2.
On the CNN 3D, beside connection with neighbors, the cell has other connection to upper and lower layer in the three-dimensional space [18] as shown in Figure 3.
Thus, if radius r = 1, the cell C(i,j,k) has 26 neighbors; hence, the templates A and B have more three coefficients A(i,j,k) and B(i,j,k).
The state equation of CNN 3D takes the form: The output function is similar to CNN 2D: For the problem-solving of three-dimensional PDE, the CNN 3D must be used. The original PDE is differentiated and from that the appropriate templates (A,B,z) of the CNN 3D are generated.   Field-programmable gate array (FPGA) is the technology in which the blank blocks have available resources of logic gates and RAM blocks are used to implement complex digital computations. FPGAs can be used to implement any logical function. The FPGA block is able to update the functionality after shipping, partial reconfiguration of a portion of the design, and the low nonrecurring engineering costs relative to an ASIC design [13][14][15][16].
A recent trend has been to take the coarse-grained architectural approach by combining the logic blocks and interconnects of traditional FPGAs with embedded chips and related peripherals to form a complete "system on a programmable chip" [17][18][19].
Users like teachers and students could use FGGA for making prototypes for testing application system, with VHDL or Verilog users easily design and test and then reconfigure the system until it has desired results.

Using FPGA to make CNN chip for solving PDE
Because the CNN architecture is not the same for every application, based on the standard model, the designer develops a particular chip for each problem. FPGA is the most useful for configuring a blank chip to make a CNN chip using programming language like Verilog or VHDL. For solving PDE, firstly, one needs to analyze (differencing) the original model of partial differential equations for finding appropriate template, then base on template found designing architecture CNN chip, finally, using VHDL to configure FPGA following designed hardware making CNN chip.
Some PDEs have been solved using the CNN technology: Burger equation [3]:

Klein-Gordon equation [19]
: Heat diffusion equation [3]: Black-Scholes equation [9]: Air pollution equation [4]: Saint venant 2D equation [5]: Example of making a CNN chip for solving Saint venant 1D: • Designing the templates First, changing the original equation (4) b and then choosing the difference space of variables x with step Δx for right part of (6). After differencing only the right side of (6) for space variable x by Taylor expansion, one has equation for cell at position (i): Note that, following the CNN algorithm, on the left, we do use symbol (∂h=∂t). From (7), one has found templates: where R h is the linear resistance on cell circuit of h. For Eq. (5), changing slightly with assumptions above: Assume that q > 0, then k q = 0. After differencing, applying the template design algorithm of CNN, one can has templates for (8): From template found, we can design the CNN architecture for problem as (1) two layered-1D CNN chip ( Figure 4) and (2) the h, Q processing block ( Figure 5).
The cell is mixed both of h, Q in one block to make the physical architecture of a CNN cell.
In general, for each calculation, we need some basic computing block like ADDITION, SUBTRACT, MULTIPLE, DIVIDE. When designing a CNN cell using FPGA, one has to design many separate blocks of them to perform arithmetical  processing for each input. In order to save computing resource in FPGA, the method that shares basic block in one cell leading to sequential calculating can be used ( Figure 6). In this case, the processing time of each cell will be high. To reduce the processing time of each cell, we can use a pipeline mechanism shown in Figure 7, but it needs more computing resource for each cell. Finally, for cells in a CNN chip, we process parallel as in Figure 8.  C1, … , C4 are the coefficients as shown in Figure 7, Figure 7. With the length of a pipeline is 6, the first calculation pays 6 clock pulse (clk), and each calculation after that only needs 1 clk.

Physico-mathematical model of Navier-Stokes equations
In hydraulics, many flow models have been researched, such as flows in channels, streams, or rivers, for controlling the flow for preventing disasters, saving water, and exploiting energy of the flow as well. Most of mathematical models of those phenomena are partial differential equations like Saint venant equations and Navier-Stokes equations [8,9]. Some types of Navier-Stokes equations have various parameters and constraints. Using CNN technology, we could solve some of them which have clear values of boundary conditions; it means we do not research boundary problems deeply. The effectiveness of the CNN technology is making a physical parallel computing chip to increase the computing speed for satisfying a real-time system.
Navier-Stokes equations here consist of three partial differential equations, with functional variables representing water height, and flow velocity in x-and y-directions. The empirical model is a flow through a small port, which diffuses in two directions Ox and Oy.
Solving Navier-Stokes equations by using CNN requires the discretion of continuity model by difference method, the smaller difference intervals the higher accuracy. However, if difference intervals are too small, then it leads to increasing the calculation complexity and time. The CNN chip with parallel physically processing abilities, the above difficulties will be overcome.
Assume that the height of water is taken from the bottom of the flow, which is regarded as the origin of the coordinate system, so z w has no negative values.
• Momentum equations in x-direction: • Momentum equations in y-direction: Explain the meanings of quantities in the equations: • ∂ρq x ∂t and ∂ρq y ∂t : quantities characterizing the momentum variation over time in x-axis and y-axis, respectively. • ρgd ∂z w ∂x and ρgd ∂z w ∂ y : potential energy variations of flow in x-and y-directions.
• ρgdS f x and ρgdS fy : influence of friction by bottom and walls of channel on flow in x-and y-directions. Values of S fx and S fy are determined based on physical properties of bottom and walls of hydraulic channels according to the following formulas: • τ wx and τ wy : wind pressure on free surface of hydraulic flow in x-and y-directions are calculated as follows: where: With c s1 ; c s2 ; W min are values get from practical, for example: W min = 4 m/s; wind speed is 10 m/s, then c s1 = 1.0; c s2 = 0.067; • ρ a is the air density at free surface (kgm À3 ); W is wind speed at free surface; and Ψ is the angle between wind direction and x-axis.

Analyzing and designing CNN to solve the equations
To simplify, change parameters as: the water level z w = h; and the velocity in xaxis q x = u, in y-axis q y = v. Assume that q A = 0; the kinetic influence of turbulent values between velocity in the direction from 0y to 0x (or 0x to 0y) is trivial since horizontal velocity is small enough to be considered as zero; then (9)-(11) are rewritten: Step 1: Differencing equations following Taylor formula Using finite difference grid with difference interval in x-axis as Δx and in y-axis as Δ y and apply Taylor difference formulas for Eqs. (12)-(14); we have difference equations corresponding to the equations: Step 2: Designing a sample of CNN Based on CNN state equations and difference equations (15)-(17), we can have CNN templates for layers h, u, v: • Layer h: • Layer u: Step 3: Designing hardware architecture of CNN to solve Navier-Stokes equations Based on templates found in (18)- (20), we can design an architecture for circuit for CNN chip. It is a three-layered CNN 2D. Then, the arithmetic unit for each layer and links to perform parallel calculation on chip can be made. Figure 9 shows the architecture of layer h and layer u (the layer v is similar to u).

Proposed system architecture for MxN CNN
The empirical problems that need a solution is that: firstly, identifying boundary points of whole difference grid (space); secondly, dividing the entire computing space into smaller subspaces. Division and combination of boundary areas need to perform appropriately avoiding incorrect results because of tep time computing time; thirdly, controlling real-time data exchange and combining sequential and parallel computing in a CNN chip. The CNN chip proposed in this chapter has solved similarity in the previous problems [4,5]. The new issues here are dividing computing space processing dynamic sub-boundary and combining sequential and parallel.

General MxN CNN
Each CNN cell has its own data element and a core that performs the computing function. The CNN has MxN CNN cells in which only (M-2)x(N-2) CNN cells have computing functions, so that the CNN has MxN data elements and (M-2)x(N-2) cores ( Figure 10).
The Buffer supplies MxN data elements for CNN. Each MxN data element is called as one block of data ( Figure 11).
The white area is the data element for CNN boundary cells; and the gray part is the data area which requires to be processed by CNN. The CNN arithmetic unit has size of (M-2)x(N-2) cells processing data for the gray area which is inside the input buffer unit.
The Input memory has PxQ blocks of data. It is a true dual port memory. The Temp memory also has PxQ blocks of data. It is a simple dual port memory. It is used to temporarily store data computed from CNN core and supply data for Boundary updating unit. Data that need processing sent from PC have the size of mxn ( Figure 12). Assume that m = 5, n = 6, M = 3, and N = 4; the white part is boundary and the gray part is the area requiring to be processed. Before the processing data, temporary vertical and horizontal boundaries be need to be added, as in Figure 13, column (0,3) and row (3,0).
Temporary vertical and horizontal boundaries are added to the data structure similar to CNN buffer. The data after being added from temporary vertical and horizontal boundaries will be sent to Input memory. The blocks of data in the Input memory unit (in case that mxn = 5x6, MxN = 3x4) are detailed as follows ( Figure 14).   0, 1, 2,.., 6 are the addresses of blocks. In case that mxn = 5x6 and MxN = 3x4, we have P = 3 and Q = 2.
The Boundary updating unit is in detail structure as follows (in case MxN = 3x4) ( Figure 15).
The control unit controls the activities of the whole system set by the algorithm which is as follows: (1) At every posedge of clk do if (has IO event) (4) do the IO task; (5) else (6) buffer = read(Input memory) (7) if (finish computing the first block) (8) if (BoundaryUpdating()) (9) write(Input memory) (10) }

Proposed CNN architecture when M = 3 (3xN CNN)
The 3xN CNN architecture is similar to the general MxN CNN architecture (M = 3). In order to reduce the memory consumption and simplify the Boundary updating unit, there are some differences (Figure 16).
Each block of data in the memory (Input memory or Temp memory) is 1xN data elements. Assume that the data which need processing sent from PC has the size of mxn, m = 5, n = 6, and assume that N = 4. As mention above, the data will be processed after temporary vertical boundaries are added; so that, the Input Memory unit will has 5x2 blocks of data (m = 5, Q = 2) as follow ( Figure 17).
Each block has size of 1x4 data elements. The Buffer unit is a Shift up register that has size of 3xN. The input and output have sizes of 1xN and 3xN, respectively. The input is at the bottom.
The Input memory has m rows and Q columns of blocks of data. The control unit reads the blocks in the Input memory by vertical and puts the block of data to the    input of buffer. The buffer shifts up 1 step. After step 3, the Buffer has 3xN blocks of data to supply to CNN core. After each step, the Buffer has 3xN blocks of data that need to supply to CNN core ( Figure 18).

Implementation
In this part, we implement the 3xN CNN. Q, m, and N are the parameters that we can configure before compiling and programming to the FPGA chip. For defaulting, we assigned Q = 2, m = 8, and N = 4.

Development environment
For experiencing, the ISE Design Suite software version 14.7 and ML605 evaluation board including chip XCVL240T-1FFG1156 (Virtex 6) are used to implement the schematic of CNN.
First, we use Verilog HDL language to describe the CNN architecture. Then, we use ISim simulator to verify our system. Finally, we program the system to the FPGA chip on ML605 board.
The image of experience system as in Figure 20 is as follows.

Input data for h, u, v values
The input of CNN to solve the Navier-Stokes Equation has h, u, v values. We use three Input memory units, three Buffer units, and three Temporary memory units to store h, u, v values. The data element is represented in 32-bit floating point     real numbers. Data into h, u, v are added with temporary boundaries, detailed as follow (presented in Decimal and Hex of Single-type Floating-point) (Figure 22).
The interface of each Input memory, Temporary memory for h, u, v is configurated as same in Figure 23. The initial data for the Input memory h, u, v is initialed by COE files. A COE file stores initial values for a memory (Figure 24).

Control unit
The interface of Control unit is described as follows.

System scheme
To verify the system, the interface of the top module of the system should include all the signals that we want to verify.
The top module is described as follows.

Simulation results
The ISE design software shows the device utilization summary as in Table 1. Figures 25-27 show the schematics synthesized by the ISE design software.
Comparing the new values of h in Figure 28i, k (doutH) with Figure 29, we can see that the 3x4 CNN system worked well.
The simulation results show the properness and effectiveness of installation methods. The cost for calculating the first three blocks of 1xN taken from memory units h, u, v is 10 clock pulses, of which 1 clock pulse is for initial reading Input memory, 3 clock pulse is for initial updating buffer to CNN, and 6 clock pulses for initial calculation. Each successive 1xN unit takes only 1 clock pulse to calculate, due to the use of the pipeline mechanism to update buffer to CNN and calculate at CNN arithmetic unit. After finishing reading each column of blocks of data in the Input memory, it needs 2 more clocks for initiating the buffer again. It also takes 1 clk for initial writing Temp memory, 1 clk for initial reading Temp memory, and 1 clk for initial writing result back to Input memory.  An example of h.core file to initial data for the Input memory h.
As a result, the time for one computing cycle is: As the above implementation, m = 8, Q = 2, and T = 32 (clk).

Conclusion
This chapter gives the solution for configuring CNN chip to solve Navier-Stokes equations, especially concerning to solution in the temporary boundary problem when it is required. The purpose is to divide the big data space into many subspaces. The processing of the big data space is based on the calculation of each subdata. With the input data of 32-bit floating point real number and FPGA chip Virtex 6 XCVL240T-1FFG1156, the CNN of 1x12 cells has successfully installed. The installation results show that the effectiveness of this solution mainly lies on the