Efficient Simulation of Fluids

Fluid simulation is based on Navier-Stokes equations. Efficient simulation codes may rely on the smooth particle hydrodynamic toolbox (SPH), a method that uses kernel density estimation. Many variants of SPH have been proposed to optimize the simulation, like implicit incompressible SPH (IISPH) or predictive-corrective incompressible SPH (PC-ISPH). This chapter recalls the formulation of SPH and focuses on its effective parallel implementation using the Nvidia common unified device architecture (CUDA), while message passing interface (MPI) is another option. The key to effective implementation is a dedicated accelerating structure, and therefore some well-chosen parallel design patterns are detailed. Using a rough model of the ocean, this type of simulation can be used directly to simulate a tsunami resulting from an underwater earthquake.


Introduction
Submarine earthquakes may generate tremendous disasters for human, like what occurred during the Tohoku earthquake in 2011. Even if their seismic waves may damage buildings and structures when they occur close to the coast, the tsunami they generally cause are a massive risk for humans. Indeed, the energy produced by a massive undersea quake is transmitted into the water at high speed and results in a high wave when it arrives on the coast.
To avoid human losses, tsunami's simulations can help to inform the governments and society about the risks before and after a submarine earthquake. This chapter presents solutions for implementing such simulations. The main objective is to be able to calculate the propagation of the tsunami wave into the ocean and then to simulate efficiently its effects when the wave reaches the coast. These kinds of simulation can be done in two dimensions considering only the profile of the coast or in three dimensions when all the topography is considered. In both cases, the simulation must handle how water is affected by the earthquake wave.
Viscosity is the measure depicting how a fluid resists deformations. Even water is considered having a non-nil viscosity: so, this parameter must be considered carefully for tsunami simulations. Water simulation relies on Navier-Stokes equations that describe the motion of a viscous fluid. Unfortunately, Navier-Stokes equations cannot be solved directly like it is the case for many differential equations. The only way to obtain a solution at a given time consists of approximating it through simulation. In practice, two family of methods may be used. The first one consists in discretizing the simulation space into small parts and to do the

SPH formulation
SPH is a Lagrangian approach, meaning that particles representing tiny parts of the fluids may move during the simulation. It is based on density estimation applied to moving particles, leading to an approximation of the Navier-Stokes equations. This section recalls these equations and presents the basics of SPH.

Navier-Stokes equations
Navier-Stokes equations model the dynamics of a fluid. They rely on the Newton second law, stating that the sum of the forces applied on a body equals the product of its mass by its acceleration (∑F ¼ m Á a). In practice, it is a system of two equations: the mass continuity equation and the momentum equation. The first one is given by where ρ designs the fluid density, ∇ is the gradient operator, and u is the flow velocity. The momentum equation is where Á∇ is the divergence operator, ∇ 2 the Laplacian operator, p the pressure, μ the dynamic viscosity coefficient, and g the gravity term. The right part of the momentum equation represents the sum of the forces that the fluid undergoes, where À∇p is the pressure force, μ∇ 2 u the viscosity force, and ρg the gravitational force. With the fluid velocity function being unknown, it is not possible to compute analytically its divergence. Then, the momentum equation is nonlinear.
Nevertheless, some methods allow to calculate an approximation of these two equations. Most of them regularly discretize the Euclidean space and calculate an approximation by using the finite difference theorem. The advection term (the left part of the momentum equation) is approximated placing particles into the grid and then computing their displacement. In other words, each grid cell contains a given amount of fluid, and the algorithm calculates the exchanges between adjacent cells. Such a solution is quite difficult to use into environment where some large part (like ocean) and highly detailed parts must be considered together.
Another method to approximate the Navier-Stokes equations is SPH. The Euclidean space is no more discretized. Instead, it considers some moving particles representing the fluid and their interactions. Each particle comes with its specific velocity, pressure, density, and viscosity. Then, the total derivative allows to approach the advection term (those between parentheses on the left part of the momentum equation) by a single derivative term du dt . For a given particle i, the momentum equation becomes Therefore, the acceleration a i of the particle i is given by

Introduction to SPH
SPH relies on the kernel density estimation [4]. When we only have some samples of a given function, we can estimate its value at a new location using a kernel function W and the following estimation: The kernel function W must be a unit positive function. In other words, it must satisfy the two following properties: We denote W h a kernel function with bounded support 0; h ½ . This simply means that W h x ð Þ ¼ 0 for all x < 0 or x ≥ h. This mathematical tool is used to approximate any scalar field A i for any particle i: where m j and ρ j are, respectively, the mass and the density of the jth particle. Useful kernels for liquids are given in [3]. From this simple expression, we can deduce the estimation of the gradient and the Laplacian of a scalar A i : These formulas allow to calculate the density of any particle, the gradient of the pressure, and the Laplacian of the velocity to approximate a solution of the Navier-Stokes equations. For each particle i, the SPH algorithm follows: • Compute the density ρ i : • Compute the pressure p i : where k is the gas constant and ρ 0 is the rest density.
• Compute f i , the sum of the forces at particle i of pressure, viscosity, surface tension, and gravity: where σ is the tension coefficient relating to the interface between the fluid and the exterior (the air), n i is the normal vector to a particle i, and cs i is the color field of the particle i.
• Compute the velocity u i and the new particle position x i using a small integration time step Δt: The SPH simulation uses these formulas to compute the positions of the particles for a given time length through an iterative procedure. The particles' interactions are very important: we use a rather small support (small h value) for the kernel function in order to limit the number n of neighboring particles. Then, in any good implementation, one of the key elements is the neighboring handling. Using a parallel processor, this can be achieved with a low complexity, allowing to reach short computation times.

SPH algorithms
SPH method presented in Section 2.2 is quite immediate to implement [3]. Using a small kernel support, the calculation of the forces that apply to a given particle is quite fast, since only a few numbers of neighbors have to be considered. Nevertheless, the neighborhood needs to be efficiently computed and stored to accelerate the calculations. This needs to be done for each time step. To do that, a regular grid is the faster solution. The size of a grid cell is set as the radius of the kernel support. Then, to find the neighbors of a given particle, it is enough to consider the cells surrounding the one containing this particle. In dimension 2 this leads to 9 cells (including the cell containing the particle) and 27 in dimension 3.
The SPH method described in Section 2.2 has been extended to solve some accuracy problem with incompressible fluids, for instance, predictive-corrective incompressible SPH (PC-ISPH), incompressible SPH (ISPH), and implicit incompressible SPH (IISPH) [5][6][7]. In Ref. [7], comparisons between these three techniques show that IISPH is faster than PC-ISPH and ISPH, mainly since it allows to use bigger time steps. Hence, this chapter focusses on an implementation of IISPH. This evolved method is also more complex than classical SPH, and then each time step uses more calculations (but they are longer, so it is faster still). More precisely, for each particle it calculates the density ρ i and the forces of viscosity, surface tension, and gravity like in SPH method. It adds the calculation of the advection velocity, which is the portion of the velocity independent to the pressure exerted by the other particles: The IISPH algorithm calculates the advection factor d ii and the advection coefficient a ii : The IISPH algorithm continues with the calculation of pressure's forces. It is done through at least two corrective loops to enforce the minimization of the difference between the rest density and the sum of the density of all particles. First, this loop calculates the advection density: Second, it calculates the following term per particle that will be used many times in the next steps: where l is the iteration number of the corrective loop. Notice that for l ¼ 0, IISPH uses p 0 i ¼ ωp i , with ω ¼ 0:5. Then, the IISPH corrective loop continues by computing for each particle the pressure force thanks to the following expression: where p i is the pressure at a particle i: This last term is computed using the displacement factors: All these calculations should be made in parallel to reduce the computation times, using a tuned implementation, for instance, using message passing interface (MPI) for high-performance computing (HPC) or using the Nvidia common unified device architecture (CUDA) on graphics processing unit (GPU) for simpler computers.

Parallel SPH implementation
An efficient SPH implementation relies on parallelism at some level. A fully parallel solution may become a very efficient solution, as previous works have shown it. While most of the calculations may be done considering a single particle into a single core, finding the neighboring particles that play a role in the density, the pressure, and the external forces needs collaboration between different cores.
Using the texture mechanism available with GPU, working with the neighbors is quite simple and efficient. Nevertheless, this implies to store all the particles into a regular grid at each time step during the simulation. This part is somewhere the most complicated, and the key step for an efficient implementation.
This section first presents the main parallel patterns (MAP, SORT, SCAN, etc.) and then shows how they can be combined to write a new fast parallel SPH solver.

Parallel patterns
Writing a parallel algorithm is not as simple as writing a sequential algorithm. This truism is based on the necessary consideration of the collaborations between the different processors of a parallel machine: all the processors must work in concert, and not isolated as in a sequential approach. These collaborative aspects are the main difficulty. How to make sure all these processors expect when it's needed and work to the fullest when no synchronization is required?
Rather than writing a parallel algorithm based on classical sequential patterns, parallel patterns make it possible to write a parallel algorithm directly, abstracting the underlying machine. These patterns rely on very simple parallel architecture, called the parallel random-access memory (PRAM). It assumes a synchronization between an infinite set of processors and an infinite amount of memory [8].

Simple parallel patterns
Simple parallel patterns do not need synchronization. This means that, using a GPU or an HPC, they may be run without any difficulties, even with less processors than needed. The simpler one is the MAP, or transform, that consists in applying a given function f to an input data to obtain the output. The key of this pattern is about the localisation of the data: input and output are generally considered as vectors (or arrays). Then, MAP applied to data at the same index: Figure 1 describes this pattern on small arrays.
In many occasions, it is necessary to write the result at a new location, another index. When each possible destination index is used once and only once, we obtain a quite simple parallel pattern called SCATTER. It consists of writing the input data from location i to the destination location map i ð Þ, map being a permutation function. Figure 2 illustrates this parallel pattern.
In the same spirit, the GATHER parallel pattern writes at index i data coming from index map i ð Þ, using again a permutation function. To differentiate between a SCATTER and a GATHER, you should remember that at first we read contiguous data, while in the second, we write at contiguous location. This is resumed with the following two expressions: PRAM model is very useful to write efficient algorithms on theory. Nevertheless, at the end these algorithms run on real computers, with a limited amount of memory and a fixed number of processors. Brent's theorem links the theoretical computation time on PRAM model with the one obtained using only p processors: an algorithm made in O 1 ð Þ using m processors will run in O m p using only p processors. This allows to predict the behavior of (very) simple algorithm on a GPU.

Advanced parallel patterns
In many cases, some degree of collaboration is needed between processors. This leads to some more complicated parallel patterns. A very common parallel pattern using such a collaboration is the SORT that sorts data according to a given order. It is used in previous SPH implementation for building the neighbors' grid. The SORT pattern is based on the PARTITION pattern that moves values with respect to a given predicate. More precisely, for n values X i and using the predicate P i ∈ 0; 1 ½ , the PARTITION pattern moves the values X i for which P i ¼ 1 at the beginning of the resulting array, the others at the end (see Figure 3 for a simple example).  The SCATTER and GATHER patterns move data using a permutation.
These complex patterns are built using a fundamental pattern called SCAN. It corresponds to a prefix sum of values, according to the following expression: The fundamental pattern exists in two versions: inclusive and exclusive ones. The first corresponds to the expression given above, doing a sum-up to the current output position. The exclusive version omits the current position, doing a sum-up to i À 1 and using a nil value for Y 0 (generally, using 0): Figure 4 shows that these two versions of SCAN are almost the same, except the shift between the resulting arrays: the values obtained with inclusive version correspond to the ones obtained with the exclusive version at the same position plus one.
Another pattern of interest into this chapter is the REDUCE that allows to calculate a single value from an array of values and using any given associative binary function: For instance, using X ¼ 1; 2; 3; 4; 5; 6; 7; 8; 9; 10 ½ and the classical integer sum as binary operator, this pattern returns Y ¼ 55, the sum of the 10 first non-nil integers.
These complex patterns have roughly speaking all the same complexity, in O log n ð Þon a PRAM machine and O n p log n p using p processors only. Nevertheless, since they are built using the SCAN, PARTITION and SORT are in practice more complex and take more time. A fast implementation of the SORT pattern relies on the radix sort algorithm that loops over the number of digits of the maximum key to sort, thus having a practical complexity in O 32 log n ð Þwith 32-bit integers.  The last programming tool this section covers is the atomic operation notion. A load-modify-write operation cannot be handled in parallel program without caution. Let us consider two processors doing a "plus one" in parallel at the same time. The addition is done by the CPU using registers (local memory to the CPU). Hence the variables to add need to be loaded from the main memory, then added, and then stored into the main memory. If the two processors do the load-modify-write operation at the same time exactly on the same variable, then the result is false. If the processors are not exactly synchronized, the result is certainly false also: to be correct, the two operations must be done sequentially. Atomic operations provide this behavior, performing the read-modify-write operation for one and only one processor at a time.
Obviously, other parallel patterns exist. They are not discussed in this chapter since they are not used in our SPH implementation.

Grid building
Previous SPH implementations use the SORT pattern to build the neighbors' grid [7,9,10]. The first step consists in calculating the grid index of each particle, using a MAP. Next, the particles are sorted with respect to this index. Then, it is necessary to compute the number of particles per cell and the starting position of each cell. In [9], atomic operations are used for these two operations: the minimum for the first particle into each cell and the addition for the number of particles per cell.
In Ref. [7], authors follow a similar approach with the particle sort with respect to their cell index but using a MAP to mark the start and the end of each cell with respect to the sorted cell indices, considering their unicity.
The main problem is that the sorting algorithm takes a large part of the computation time, near 30% according to [10]. In this chapter, we avoid the full sorting by combining simple parallel patterns and atomic operations. Our grid building algorithm is summarized in Figure 5.
This algorithm uses the Nvidia Thrust API with some freedom to shorten it. First, at line 8 the number of particles per grid cell is set to zero. Next, like with previous methods at line 9, the index of each particle is calculated with a MAP. Using a second MAP at line 10, the particle cell offset is calculated using an atomic addition. More precisely, we use the CUDA int. atomicAdd(int*cc, int. a) function that adds a to the variable *cc and returns the old content of *cc. Since atomic operations are done in sequence, the number of particles per cell is correctly computed. Moreover, each particle receives the old counter value, which is 0 for the first atomic operation execution, 1 for the second, and so on up to m À 1 for the last particle added to the cell, m being the number of particles added into the cell. These values are used to scatter all the particles to their local position into the grid, at line 13. But, before to do that we need to calculate the global grid offset, corresponding to the position of the first particle of each cell. This is done using an exclusive SCAN at line 11 to compute the global offset, followed by a MAP at line 12 to calculate each particle global offset.
In practice, this algorithm can be optimized in many ways. First, the device vectors can be allocated only once, and not each time the grid is built. Second, the first two transforms (lines 9 and 10) can be mixed into one. This will limit the memory loading into device registers, known as a major performance limitation with GPU. At last, the last transform (line 12) and the scatter (line 13) can be mixed into a single call again to minimize the memory bandwidth usage. Moreover, the particles' data must be split into multiple arrays for efficiency (one array for position, one for density, one for pressure, etc.) as in [10].

Main algorithm
The most difficult part of the implementation of the IISPH method is the construction of the neighbors' grid, as for any non-mesh density kernel method. The rest of the calculation is rather simple and relies on two parallel models: the MAP for all the loops on particles and the REDUCE to control the termination of the corrective loop in the calculation of the pressure force.
It is noticeable that the IISPH loop to correct the pressure force runs on the CPU, because there is no available global synchronization on the GPU. Then, the REDUCE is used to return a value from the GPU to the CPU, to decide if more corrections are needed or not. Nevertheless, since this just consists of sending one real value, it is not a big bottleneck.
Moreover, many calculations use data from the neighbors (pressure, density, position, etc.). L1 GPU's memory is used to accelerate these calculations, reducing the computation time around a third in our experiment. Notice also that the IISPH corrective loop amortizes the neighbors' grid building. In our experiments the grid building now represents less than 10 percent of the full computation time.

Experiments
The IISPH is a valid solution to simulate a tsunami [11]. Its main advantage regarding a discrete method is that it does not need to refine the mesh near the obstacles, like the coast and the buildings. Moreover, the wave can go everywhere, including interfering with the beach, buildings, infrastructure, etc.
In this chapter, we illustrate the tsunami simulation using IISPH algorithm through a rather simple scenario. It contains a short coast ending with a mountain. We put a building just after the beach. The main difficulty, if either, consists of generating the solitary wave. A tsunami, for instance, is generated by an earthquake at long distance. The produced wave runs at 200 meter per second (720 km/h). We do not need to simulate the propagation of the wave since its epicenter, which is quite difficult with long distance: it needs very long simulation time to see the wave reaching the beach, and obviously it needs a huge amount of memory to handle the sea between the two distant locations. Instead, we simulate the wave into a rather small space. We can predict the time of arrival to the beach, assuming we know the exact distance between the beach and the earthquake location.
In [11], authors solve the solitary wave solution of Boussinesq. They calculate the wave paddle displacement using the equation: where C is the wave velocity, t is the time in the simulation frame, κ is the decay coefficient, and θ is given using Newton's method by More precisely, θ is the solution of the following problem: where H and h are the wave height and the water depth, respectively. While this method works on CPU, it is not well-suited for a CUDA implementation of the IISSPH, mainly because the number of iterations of the Newton-Raphson method depends on the input values, and so is not constant per particle.
Hence, in this chapter we use a different but simple technique. The wave is produced using a piston wave generator. Here, the piston is a huge virtual object that moves the water to reach the speed of the wave. The length and the speed of the piston movement are calibrated to obtain the good height and speed of the tsunami solitary wave. Figure 6 illustrates such a simple wave simulation, before the tsunami wave arrives. Figure 7 shows the wave arriving at the building at t ¼ 4 s in the simulation frame. In Figure 8, at t ¼ 8 s, the building is completely below the ocean that returns  into its bed after some more time (see Figures 9 and 10). The building is made with fixed particles that nevertheless are considered with the moving particles of the fluid. This allows an interaction between two kinds of particles, and it permits to obtain the pressure and force applied to the building, for instance, to check if it will resist or not.

Conclusions
This chapter focusses on the simulation of a tsunami solitary wave. Such a wave is mainly produced by submarine earthquake and may provoke vast disasters for   human living near the coasts. Such phenomena also may produce strong degradation on buildings and structures, in turn inducing human loss as what happened after the Tohoku earthquake in 2011. To avoid these disasters, it is important to be able to validate the robustness of structures and buildings near the dangerous coasts and to inform population after an always unpredictable submarine earthquake.
To achieve these goals, it is necessary to produce robust and fast fluid simulator software. To simulate a tsunami wave, a good candidate is the SPH method. Since it does not need the usage of a fix mesh like in discrete techniques, it allows to handle correctly the wave running on the beach and after. Moreover, it correctly handles the contact with buildings and structures, allowing to simulate the forces that they undergo.
This chapter recalls the implicit incompressible SPH method, which is one of the fastest among the SPH ones. The parallel implementation for GPU is detailed in depth, with a fast algorithm to build the neighbors' grid, avoiding the classical sorting method which is more time-consuming.
At last, this chapter proposes a simple tsunami wave simulation using a piston wave generator, a simple solution for implementing and providing valuable results. It can be used to simulate tsunami generated by submarine earthquake occurring in a pattern of seismic source mechanism when both the location and intensity are estimated.