Free-Surface Flow Simulations with Smoothed Particle Hydrodynamics Method using High-Performance Computing

Today, the use of modern high-performance computing (HPC) systems, such as clusters equipped with graphics processing units (GPUs), allows solving problems with resolutions unthinkable only a decade ago. The demand for high computational power is certainly an issue when simulating free-surface flows. However, taking the advantage of GPU’s parallel computing techniques, simulations involving up to 10 particles can be achieved. In this framework, this chapter shows some numerical results of typical coastal engineering problems obtained by means of the GPU-based computing servers maintained at the Environmental Physics Laboratory (EPhysLab) from Vigo University in Ourense (Spain) and the Tier-1 Galileo cluster of the Italian computing centre CINECA. The DualSPHysics free package based on smoothed particle hydrodynamics (SPH) technique was used for the purpose. SPH is a meshless particle method based on Lagrangian formulation by which the fluid domain is discretized as a collection of computing fluid particles. Speedup and efficiency of calculations are studied in terms of the initial interparticle distance and by coupling DualSPHysics with a NLSW wave propagation model. Water free-surface elevation, orbital velocities and wave forces are compared with results from experimental campaigns and theoretical solutions.


Introduction
The non-stopping growing of computing power allowed increasing more and more spatial and temporal discretization when simulating engineering problems. The use of modern highperformance computing (HPC) systems, such as clusters equipped with graphics processing units (GPUs) or central processing units (CPUs) structured into a multi-node framework, let academics and professionals solve free-surface flow problems with resolutions unthinkable just a decade ago. Different spatial and temporal scales are often involved when simulating such kinds of phenomena, which may comprise wave generation, propagation, transformation and interaction with coastal or inland defences.
Since a decade or so, SPH has been coded in the massive high-performance computing (HPC) context, making use of the Message Passing Interface (MPI) [56,57] and the OpenMP library [58,59], the standards for distributed and shared memory programming, respectively. Several applications involving multicore processors [60,61] and graphics processing units (GPUs) [62][63][64][65][66][67] have been proposed so far. Joselli and co-workers [68] showed in 2015 that performing neighbour search on GPUs yields up to 100 times speedup against CPU implementations, therefore proving the benefits on exploiting the high floating-point arithmetic performance of GPUs for general purpose calculations. The same conclusion was drawn earlier in Ref. [69]. The first versions of SPH running on GPUs were presented in Ref. [70] and then in Ref. [69]. Non-Newtonian fluid flow simulations have been carried as well. Bilotta and co-workers, for instance, applied their GPUSPH model to lava flows [71]. In 2013, Wu and co-workers run GPUSPH to model dam-break flood through complex city layouts [72,73]. Rustico et al. [74] measured the overall efficiency of the GPUSPH parallelization by applying the Karp-Flatt metric [75]. In Ref. [76], massive simulations of free-surface flow phenomena were carried on single and multi-GPU clusters. They used the sorting radix algorithm for inter-GPU particle swapping and subdomain 'halo' building to allow SPH particles of different subdomains interacting. In 2015, Cercos-Pita proposed the software AQUAgpusph [77] based on the use of the freely available Open Computing Language (OpenCL) framework instead of using the Compute Unified Device Architecture (CUDA) platform. In Ref. [78], Gonnet proposed scalable algorithms based on hierarchical cell decompositions and sorted interactions executed on hybrid shared/distributed memory parallel architectures. In Ref. [79], a general rigid body dynamics and an absolute nodal coordinate formulation (ANCF) were implemented to model rigid and flexible objects interacting with a moving fluid. In 2012, Cherfils and co-workers released JOSEPHINE [80], a parallel weakly compressible SPH code written in Fortran 90, intended for free-surface flows. Incompressible SPH (ISPH) algorithms, running on GPUs, have been developed as well [81][82][83].
This chapter shows some numerical SPH results of typical coastal engineering problems obtained by means of two different supercomputers: the GPU-based machine maintained at the EPhysLab from Vigo University in Ourense (Spain), mounting 14 NVIDIA Kepler-based cards, with a total of 39168 CUDA cores and the Tier-1 Galileo cluster, introduced on January 2015 by the Italian computing centre CINECA, a non-profit consortium, made up of 70 Italian universities, 6 Italian research institutions and the Italian Ministry of Education, University and Research (MIUR). Galileo is equipped with 516 nodes, each mounting 2 8-cores Intel Haswell 2.40 GHz for a total of 8256 cores, up-to-date Intel Phi 7120p (2 per node on 384 nodes) and NVIDIA Tesla K80 accelerators (2 per node on 40 nodes). Comparison with theoretical and experimental results is also included.

SPH fundamentals
Recent comprehensive reviews and related applications of the SPH method are given in [84][85][86][87][88][89]. Governing equations describing the motion of fluids are usually given as a set of partial differential equations (PDEs). These are discretized by replacing the derivative operators with equivalent integral operators (the so-called integral representation or kernel approximation) that are in turn approximated on the particle location (particle approximation). Next, Section 2.1 gives further details about these two steps, with reference to a generic field f( x ! ) depending on the location point x ! ∈ ℜ n d , whereas Section 2.2 provides more specific details concerning the treatment of Navier-Stokes equations.

Approximation of a field f(x) and its spatial gradients
Following the concept of integral representation, any generic continuous function f x ! can be obtained using the Dirac delta functional δ, centred at the point x ! (Figure 1) as where Ω y represents the domain of definition of f and x ! , y ! ∈ Ω. Replacing δ with a smoothing function W x ! À y ! ; h , Eq. (1) can be approximated as W is the so-called smoothing kernel function or simply kernel and h, acting as spatial scale, is the smoothing length defining the influence area where W is not zero. While Eq. The kernel function W has to satisfy some properties (see, for instance, [90,91]). The following condition ð is known as partition of unity (or the zero-order consistency) as the integration of the smoothing function must yield the unity. Since W has to mimic the delta function, Eq. (3) can be rewritten as a limit condition in which the smoothing length tends to zero: Still, W has to be defined even, positive and radial symmetric on the compact support: where ϕ is a positive quantity defining the extent of the compact support. A large number of kernel functions are proposed in literature. Among the others, a computational-efficient and high accurate kernel is proposed by Wendland [92], defined as.
where A(n d ), depending on the number of dimensions n d , denotes a scaling factor that ensures the consistency of Eq. (3), whereas q denotes the dimensionless distance x ! Ày ! =h: The integral representation given by Eq.
(2) can be converted into a discretized summation over all particle N within the compact support ( Figure 2), yielding the particle approximation: where the index k refers to particles within the compact support (see bold ones in Figure 2), with mass m k and density r k being carried. Note that in this case the particle approximation is marked by the 'a' pedix. The subscript will be avoided from now on. Eq. (7) can be rewritten with reference to particle 'i' as Particle approximation of spatial derivatives of a field function, such as divergence and gradient, is expressed using the gradient of the kernel function rather than the derivatives of the function itself: Figure 2. A kernel function defined at the particle 'i' and its support of radius ϕh. Local neighbourhood corresponds to the bold particles.
where the nabla operator ∇ ! is referred to the location of particle 'i'. The symbol '•' denotes the dot product. Eqs. (9) and (10) offer the great advantage of estimating their left-hand side in terms of the kernel gradient, i.e. allowing no special hypotheses on the particular field function. A different formulation of the gradient field can be derived by introducing the following identity [87] ∇ inside the integral in Eq. (2), yielding in this case Likewise, the divergence, another particle approximation of the gradient, can be derived, taking into account the following equation: yielding Eqs. (12)- (14) are conveniently employed in fluid dynamics as they preserve the conservation of linear and angular momentum.

SPH form of governing equations
The mostly used governing laws ruling fluid motion are the Navier-Stokes equations, which specify that mass and linear momentum are preserved. Conservation laws in Lagrangian form are as follows: in which r and v ! are, respectively, the density and velocity field, p is the isotropic pressure, ν is the laminar kinematic viscosity, is the Laplacian operator and f ! the external force. Different approaches [93][94][95] are available to derive the density particle approximation of the continuity, Eq. (15a), and momentum, Eq. (15b). For instance, referring to Eq. (12), the density rate at particle 'i' can be approximated as follows: The material derivative of the velocity field can be deduced from Eq. (14) for the case of inviscid fluids, that is, ν = 0: Numerical diffusion in terms of an artificial viscosity, e.g. proposed in Ref. [96], can be added in Eq. (17), allowing shock waves to be properly simulated: The dissipative term Π ik introduced above is the most general viscosity used in SPH computations, since it provides good results when modelling shock fronts. It is here defined as The The term c refers to the speed of sound which magnitude has conveniently to be at least 10 times greater than the maximum estimate of the scalar velocity field [94], η = 0.1 h is employed to prevent numerical divergences when two particles are approaching and α is a coefficient that needs to be tuned in order to introduce the proper dissipation. A value of α = 0.01 is suggested in Ref. [5] for wave propagation and wave-structure interaction studies.
Problem closure is achieved by combining conservation equations in discrete form (16) and (18) with an equation of state, when the weakly compressible scheme is adopted. A relationship between pressure and density is given in Ref. [97]: where c 0 is the reference speed of sound, large enough to guarantee Mach numbers lower than 0.1-0.01, γ ¼ 7, r 0 = 1000 kg/m 3 , when the liquid is water. c 0 is numerically computed like at least 10 times the expected maximum velocity.

The DualSPHysics code
DualSPHysics [98][99][100] is an open-source code developed by the University of Vigo (Spain) and the University of Manchester (UK) in collaboration with experts from all around the globe that can be freely downloaded from www.dual.sphysics.org. The code, written in two languages, namely, C++ and CUDA, is capable of using the parallel processing power of either CPUs or GPUs making the study of real engineering problems possible. Graphics processing units (GPUs) are massive floating-point stream processors adopted in computer game industry and image processing. Recently, they have been used in scientific computing thanks to the widespread of tools such as CUDA and OpenCL. Using CUDA as the programming framework for SPH leads to possible confusion with the word 'kernel'. An SPH kernel is the weighting function used in the SPH interpolation process in particle approximation of ruling equations, e.g. Eqs. (16) and (18). A CUDA kernel, however, is defined as a CUDA function that is set up and executed N times in parallel by N different CUDA threads. Herein, to avoid confusion, we use the term function to describe the CUDA kernels. DualSPHysics makes full use of the function hierarchy present within the CUDA framework. A function executed and called by the CPU is declared as a host function, whereas a global function is called by the CPU but executed in parallel by the GPU. A device function, on the other hand, is only called and executed within the GPU by a global or another device function. This hierarchy is used for the computation of the interparticle forces. The simulations in DualSPHysics consist of three main steps: (i) creation of the particle neighbour list (NL), (ii) force computation (FC) for the particle interaction and (iii) the system update (SU) at the end of the time step.
Due to the Lagrangian nature of SPH, the particle interaction results to be the most timeconsuming part of the whole algorithm. Each particle, as already stated, only interacts with its neighbour particles. Therefore, the construction of the neighbour list must be optimised. The cell-linked list described in Ref. [54] is implemented in DualSPHysics. This approach is preferred to the traditional approach, named Verlet list [101] that implies higher memory requirements than the cell-linked list. Besides, [54] proposed an innovative searching procedure based on a dynamic updating of the Verlet list and analyzed the efficiency of all the algorithms in terms of computational time and memory requirements.
DualSPHysics has proven its performance, reaching limits like being able to simulate more than 10 9 particles using 128 GPUs with an efficiency close to 100% [67].

Boundary conditions
Extensive research has been conducted over the last few years to develop accurate and efficient boundary conditions (BCs) in SPH method. Several approaches are proposed in the literature, such as boundary repulsive forces, fluid extensions to the solid boundary and boundary integral representing the term preservation. In DualSPHysics, boundaries (walls, bottom, coastal structures, wave generators, vessels, floating devices, etc.) are described using a discrete set of boundary particles that exert a repulsive force on the fluid particles when they approach. The so-called dynamic boundary condition [102] is used in DualSPHysics, where the boundary particles satisfy the same equations as the fluid particles; however, they do not move according to the forces exerted on them. Instead, they remain fixed (fixed boundary) or move according to some externally imposed movement (gates, flaps, etc.). Using this boundary condition, when a fluid particle approaches a boundary particle and the distance between them decreases beyond the kernel range, the density of the boundary particles increases giving rise to an increase of the pressure. This results in a repulsive force being exerted on the fluid particle due to the pressure term in the momentum equation. This dynamic boundary condition implemented in DualSPHysics does not include a specific value to define wall friction. However, this has been achieved in different validations by specifying a different viscosity value in the momentum equation when the fluid particles interact with the boundary ones.

Long-crested wave generation
The waves are generated in DualSPHysics by means of moving boundaries that aim to mimic the movement of a piston-type and flap-type wavemakers as in physical facilities. Only longcrested wave can be generated at this stage. The implementation of first-order and secondorder wave generation theories is fully described in Ref. [3]. For monochromatic waves, this means to include super-harmonics. For random waves, subharmonic components are considered to suppress spurious long waves. Two standard wave spectra are implemented and used to generate random waves: JONSWAP and Pierson-Moskowitz spectra. The generation system allows having different random time series with the same significant wave height (H m0 ) and the same peak period (T p ), just defining different phase seeds.

Wave reflection compensation
Wave reflection compensation is used in physical facilities to absorb the reflected waves at the wavemaker in order to avoid that they will be reflected back into the domain. In this way, the introduction into the system of extra spurious energy that will bias the results is prevented. The socalled active wave absorption system (AWAS) is implemented in DualSPHysics. The water surface elevation η at the wavemaker position is used and transformed by an appropriate time-domain filter to obtain a control signal that corrects the wave paddle displacement in order to absorb the reflected waves every time step. Hence, the target wavemaker position is corrected to avoid reflection at the wavemaker. The position in real time of the wavemaker is obtained through the velocity correction of its motion. Further details on AWAS in DualSPHysics are reported in Ref. [3].

Hybridization with SWASH model
The hybridization technique between DualSPHysics model and SWASH model (http://swash. sourceforge.net/) is fully described in Ref. [5]. This technique aims to use each model for a specific purpose that best matches with its own capabilities, reducing the total computational cost and increasing the model accuracy. The advantages of using a hybridization technique between SWASH and DualSPHysics can be summarised as follows: • Fast computations with large domains can be performed with SWASH, avoiding simulating large domains with DualSPHysics that requires huge computation times even using hardware acceleration.
• SWASH is suitable for calculation where statistical analysis is necessary such as computing wave height and good accuracy is obtained for wave propagation.
• SWASH is not suitable for calculation of wave impacts, while DualSPHysics can easily compute wave impacts, pressure load and exerted force onto coastal structures.
• Complex geometries cannot be represented with SWASH, and computation stability problems may appear when applied to rapidly changing bathymetry. Using DualSPHysics, any complex geometry or varying bathymetry can be simulated.
The hybridization between DualSPHysics and SWASH has been obtained through a one-way hybridization at this stage. The basic idea is to run SWASH for the biggest part of the physical domain to impose some boundary conditions on a fictitious wall placed between both media. This fictitious wall acts as a nonconventional wave generator in DualSPHysics: each boundary particle that forms the wall (hereafter called moving boundary or MB) will experience a different movement to mimic the effect of the incoming waves. SWASH provides values of velocity in different levels of depth. These values are used to move the MB particles. The displacement of each particle can be calculated using a lineal interpolation of velocity in the vertical position of the particle. Therefore, the MB is a set of boundary particles whose displacement is imposed by the wave propagated by SWASH and only exists for DualSPHysics. A multilayer approach can be used in SWASH. The SWASH velocity measured at each layer is therefore interpolated and converted into displacement time series for DualSPHysics.

Hardware features 4.1. The EPhysLab cluster
The GPU cluster maintained at the Environmental Physics Laboratory (EPhysLab) of Vigo University comprises four computing servers, whose details are as follows: • • Two NVIDIA accelerators K80 per node on 40 nodes (80 in total, 20 available for scientific research) • Eight nodes devoted to login/visualisation

Test cases
Aiming to prove the capability of DualSPHysics model to reproduce accurately waves and wave-structure interaction phenomena, three different test cases have been selected and are here reported, namely: 1. Wave generation and propagation of random wave train in 2D.

2.
Wave run-up on a cubic block breakwater in 3D.

3.
Coupling of DualSPHysics with SWASH model and application to wave forces on coastal structures in shallow water conditions (2D).

Test case N. 1
The first test case comprises the generation and absorption of random waves in DualSPHysics. The 150 s time series to be generated is calculated starting from a JONSWAP spectrum. The target wave conditions are H m0 = 0.06 m and T p = 1.3 s. The water depth is 0.36 m. The wavelength is 2.09 m. Second-order wave generation has been used (i.e. bound long waves). The wave conditions correspond to a second-order Stokes wave. The geometrical layout of the case is depicted in Figure 3: an 8.4-m long wave tank is modelled. A damping zone (passive absorption) is defined at the end of the tank. The water surface elevation and orbital velocities are measured using a 5-wave gauge array where the central wave gauge is at 2 L from the generator. The numerical results are compared with theoretical solutions.
A sensitivity analysis on the initial interparticle distance, dp, has been carried out. Four different values of dp have been selected in a range of H/dp between 6 (coarsest resolution) and 20 (finest resolution). For each case, the number of fluid particles and the computational runtime are reported in Table 1.  The case with H/dp = 10 has been simulated also on Galileo supercomputer, specifically using one node in order to compare the computing capabilities between Galileo and one GPU from the EPhysLab cluster. The comparison is expressed in terms of the number of calculation steps per second of computational time. For each node in Galileo, 23.9 step/s can be simulated, whereas with a Tesla K20, 156.3 step/s are achieved. These results refer to 2D simulations, and they are expected to be different for 3D modelling.
The numerical results for H/dp = 10 have been plotted against the theoretical ones for each sensor position. They are all depicted in Figures 4-6. The model accuracy has been estimated in terms of spectral values of wave height and period. The numerical error, together with the calculated values for H m0 and T m-1,0 , is reported in Table 2 for WG3 (x = 4.18 m). For H/dp = 6, the wave height is underestimated about 4%; meanwhile, starting from H/dp = 10, the errors for both wave height and period are in the order of 1-2%. Similar results are attained for the other four wave gauges. The orbital velocities show same degree of accuracy.

Test case N. 2
The second test case consists of a 3D case where the wave run-up on an armour breakwater has been simulated. Cubic blocks are displaced forming two layers with regular pattern on the seaward face of the breakwater, which has an angle of 28.3 with the horizontal. The side of each  Figure 7. Using post-processing tools of DualSPHysics, an isosurface of the fluid has been extracted and plotted in ParaView software (www.paraview.org): this is coloured in blue in Figure 7. The two layers of cubic blocks are coloured in grey. The four yellow dots on the free surface indicate where the water surface elevation has been measured. The coloured area (from yellow to white) indicates all locations where the run-up has been measured.  The water surface elevation measured in the numerical tank is depicted in Figure 9 for each wave gauge location. The wave run-up has been calculated for 26 cross sections along the width of the flume: the averaged time series is shown in Figure 10.   The third test case comprises the validation of the hybridization technique between DualSPHysics model and SWASH model to study the impact of overtopping flows on multifunctional sea dikes with shallow foreshore. The main aim is to prove that overtopping flow characteristics and wave forces are modelled correctly and that the hybridization can represent a reliable solution that can be used as complementary or alternative to physical     (Belgium) to measure forces on the vertical wall (i.e. building), the layer thickness and velocities of the overtopping flows [103]. The geometrical layout is depicted in Figure 11: the foreshore slope was 1:35 and dike height 0.1 m. The dike slope was 1:3. Here, we refer to only regular wave cases.
SWASH has been previously validated against the physical model results: wave propagation, transformation and breaking have been accurately modelled, and the conditions at the toe of the dike are reproduced as in the physical model test. Then, SWASH has been implemented together with DualSPHysics to model the wave impact. Eight layers have been used in SWASH simulation. A hybridization point along the physical domain has been defined, and it is located at x = 30.24 m from the physical wavemaker in its neutral position (Figure 11 An initial interparticle distance, dp, of 0.003 m has been used leading to 494,388 fluid particles in DualSPHysics model. Fifty seconds of physical time has been simulated in the TITAN X graphic card, taking 10.8 h. A case with the whole physical domain modelled in DualSPHysics has been also modelled: in such case, the moving boundary is represented by the physical Figure 11. Layout of the flume at FHR and indication of the coupling point location for the SWASH-DualSPHysics model. wave generator, and its location is then at x = 0.00 m. This stand-alone DualSPHysics model took 95.6 h using the same TITAN X to simulate 3,389,266 fluid particles, about 10 times slower than the hybridised model. The numerical and experimental free-surface elevation and layer thickness are plotted in Figure 12, showing that the numerical solution resembles the experimental accurately. The forces on the wall are represented in Figure 13. The differences between numerical and experimental results might be explained because of the highly turbulent and stochastic nature of the overtopping wave impact in this case, which makes the experimental test not repeatable (see [3] for further discussion on model inaccuracy for wave impacts).

Conclusions
The chapter offers a panoramic on the application of the SPH-based DualSPHysics code on supercomputers maintained at the EPhysLab from Vigo University in Ourense (Spain) and the Italian computing centre CINECA. Three test cases were selected in the general context of the coastal engineering: (1) wave generation and propagation of random wave train in 2D, (2) wave run-up on a cubic block breakwater in 3D and (3) coupling of DualSPHysics with SWASH model and application to wave forces on coastal structures in shallow water conditions (2D). Scalability is discussed by varying the spatial resolution, and efficiency is proved in the case of the hybridization. Comparison with theoretical free-surface elevation and orbital velocities for test case N. 1 and measured overtopping layer thickness and forces on vertical walls for test case N. 3 was satisfactory.