## 1. Introduction

A considerable number of modern high-fidelity Computational Fluid Dynamics (CFD) solvers and codes still adopt either one-dimensional physical models based on the Riemann problem using higher order shape functions, such as higher order Finite Volume (FV) and Discontinuous Galerkin Finite Element (FE) methods for the discrete data representation, or truly multi-dimensional physical models using linear shape functions, like Fluctuation Splitting (FS) schemes. Both of these approaches require fast convergence acceleration techniques in order to compete with conventional solvers based on artificial dissipation or upwind schemes in terms of CPU time. Implicit methods based on the Newton’s rootfinding algorithm are receiving an increasing attention in this context for the solution of complex real-world CFD applications, for example in the analyses of turbulent flows past three-dimensional wings, due to their potential to converge in a very small number of iterations [1, 2]. In this chapter, we consider convergence acceleration strategies for the implicit solution of the Reynolds-averaged Navier-Stokes (RANS) equations based on the FS space discretization using a preconditioned Newton-Krylov algorithm for the integration. The use of a Newton solver requires the inversion of a large nonsymmetric system of equations at each step of the non-linear solution process. Choice of linear solver and preconditioner is crucial for efficiency especially when the mean flow and the turbulence transport equation are solved in fully coupled form. In this study, we use the restarted Generalized Minimal Residuals (GMRES) [3] algorithm for the inner linear solver, preconditioned by a block multilevel incomplete lower-upper (LU) factorization. We present the development lines of the multilevel preconditioning strategy that is efficient to reduce the number of iterations of Krylov subspace methods at moderate memory cost, and shows good parallel performance on three-dimensional turbulent flow simulations.

The chapter is structured as follows. The governing conservation equations for both compressible and incompressible flows are reviewed in Section 2. Section 3 briefly describes the fluctuation splitting space discretization, the time discretization and the Newton-Krylov method used to solve the space- and time-discretized set of governing partial differential equations (PDEs). In Section 4, we present the development of the multilevel preconditioning strategies for the inner linear solver. We illustrate the numerical and parallel performance of the preconditioner for the analysis of turbulent incompressible flows past a three-dimensional wing in Section 5. Some concluding remarks arising from the study are presented in Section 6.

## 2. Governing equations

In the case of inviscid and laminar flows, given a control volume

where

and for incompressible, constant density flows,

Finally, the source term

## 3. Solution techniques

The model used in this study for the discrete data representation is based on the coupling of an hybrid class of methods for the space discretization, called Fluctuation Splitting (or residual distribution) schemes [5], and a fully coupled Newton algorithm. By “fully coupled” we mean that the mass, momentum and energy conservation equations on one hand, and the turbulent equation on the other, are solved simultaneously rather than in a decoupled or staggered fashion. We discuss in the following subsections, separately, the space and time discretization, the numerical integration of the set of equations resulting from the discretization, and the solution of the large linear system at each Newton’s step.

### 3.1. Space discretisation

The Fluctuation Splitting approach has features common to both Finite Element (FE) and Finite Volume (FV) methods. Like in standard FE methods, the dependent variables are stored at the vertices of the computational mesh made up of triangles in the two-dimensional (2D) space, and tetrahedra in three-dimensional (2D), and are assumed to vary linearly and continuously in space. Denoting *Z _{i}* as the nodal value of the dependent variable at the grid point

Note that, although the summation in Eq. (4) extends over all grid nodes, the computational molecule of each node is actually limited only to the set of its nearest neighbors due to the compact support of the linear shape functions. In the compressible case, Roe’s parameter vector

is chosen as the dependent variable to ensure discrete conservation [6]. In the incompressible case, discrete conservation is obtained by simply setting the dependent variable

The integral Eq. (1) is discretized over each control volume **Figure 1(a)**. With FS schemes, rather than calculating the inviscid fluxes by numerical quadrature along the boundary

is evaluated by means of a conservative linearization based on the parameter vector [6], and scattered to the element vertices using elemental distribution matrices **Figure 1(b)**. Concerning the viscous terms, the corresponding flux balance is evaluated by surface integration along the boundaries of the median dual cell: node *e* which accounts for the viscous flux through the portion of

Summing up the inviscid and viscous contributions to the nodal residual of gridpoint *i* one obtains

In Eq. (7), the summation ranges over all the elements **Figure 1(b)**. The construction of the distribution matrices

### 3.2. Time discretisation

One route to achieve second-order time accuracy with FS schemes is to use mass matrices that couple the time derivatives of neighboring grid points. This leads to implicit schemes, even if the spatial residual were treated explicitly. Although a more general framework for the derivation of the mass matrices can be devised [10], the approach adopted in our study consists of formulating the FS scheme as a Petrov-Galerkin FE method with elemental weighting function given by

The contribution of element

where the chain rule is used to make the dependent variable

In Eq. (10) the index

The matrix * ^{th}* row and j

*column of the global mass matrix*

^{th}*i*. Second-order space and time accuracy of the scheme described above has been demonstrated for inviscid flow problems by Campobasso et al. [11] using an exact solution of the Euler equations.

### 3.3. Numerical integration

Writing down the space- and time-discretized form of Eq. (1) for all gridpoints of the mesh, one obtains the following large, sparse system of non-linear algebraic equations

to be solved at time level

in pseudo-time

is used to approximate the pseudo-time derivative in the LHS of Eq. (13). The outer iterations counter

Upon replacing Eq. (14) in Eq. (13), an implicit scheme is obtained if the residual _{,} one obtains the following sparse system of linear equations

to be solved at each outer iteration until the required convergence of

The situation is different when dealing with steady flow problems. Newton’s method is only locally convergent, meaning that it is guaranteed to converge to a solution when the initial approximation is already close enough to the sought solution. This is generally not the case when dealing with steady flows, and a “globalization strategy” needs to be used in order to avoid stall or divergence of the outer iterations. The choice commonly adopted by various authors [13, 14], and in this study as well, is a pseudo-transient continuation, which amounts to retain the pseudo-transient term in Eq. (15). At the early stages of the iterative process, the pseudo-time step length

where

In the early stages of the iterative process, the turbulent transport equation and the mean flow equations are solved in tandem (or in a loosely coupled manner, following the nomenclature used by Zingg et al. [16]): the mean flow solution is advanced over a single pseudo-time step using an analytically computed, but approximate Jacobian while keeping turbulent viscosity frozen, then the turbulent variable is advanced over one or more pseudo-time steps using a FD Jacobian with frozen mean flow variables. Due to the uncoupling between the mean flow and turbulent transport equations, this procedure will eventually converge to steady state, but never yields quadratic convergence. Close to steady state, when a true Newton strategy preceded by a “short” pseudo-transient continuation phase can be adopted, the mean flow and the turbulence transport equation are solved in fully coupled form, and the Jacobian is computed by FD. For the sake of completeness, we give further details of each of these two steps in the following two paragraphs.

#### 3.3.1. Tandem solution strategy with (approximate) Picard linearization

Consider re-writing the steady nodal residual

where

which amounts to neglect the dependence of matrices

Once the mean flow solution has been advanced over a single pseudo-time step using the approximate Picard linearization, keeping the turbulent viscosity frozen, the turbulent variable is advanced over one or more (typically ten) pseudo-time steps using a FD Jacobian approximation (described in Section 3.3.2) with frozen mean flow variables. Blanco and Zingg [18] adopt a similar strategy, but keep iterating the turbulence transport equation until its residual has become lower than that of the mean flow equations. The loosely coupled solution strategy is a choice often made as it “allows for the easy interchange of new turbulence models” [19] and also reduces the storage [18], compared to a fully coupled approach. However, due to the uncoupling between the mean flow and the turbulent transport equations, the tandem solution strategy never yields quadratic convergence nor it is always able to drive the nodal residual to machine zero. The last statement cannot be generalized, since Blanco and Zingg [16, 20] report convergence to machine zero for their loosely coupled approach on two-dimensional unstructured grids. However, even if convergence to machine zero is difficult to achieve, the nodal residual is always sufficiently converged for any practical “engineering” purpose and close enough to “true” steady-state solution to be a good initial guess for Newton’s method.

#### 3.3.2. Fully coupled solution strategy with FD Newton linearization

Once the tandem solution strategy has provided a good approximation to the steady flow, or when dealing with unsteady flows, in which case the solution at a given time level is generally a good approximation to the one sought at the next time level, it becomes very attractive to take advantage of the quadratic convergence of Newton’s method. In order to do so, however, the mean flow and the turbulence transport equations must be solved fully coupled and the Jacobian matrix

When the equations are fully coupled, the structure of the Jacobian matrix

Eq. (19) follows by applying the sum rule of differentiation and by observing that the nodal residual itself is a sum of contributions from the elements that share vertex **Figure 1(b)**. Specifically, element

where *e*, which include both

where

From a coding viewpoint, the same loop over all cells used to build the nodal residual *i*) perturb each of the *ii*) evaluate the residual contribution to each of the vertices; *iii*) calculate the Jacobian entries according to Eq. (20). While looping over cell *e*, this contributes

The compactness of the FS stencil, which never extends beyond the set of distance-1 neighbors even for a second-order-accurate space-time discretization, offers two advantages. On one hand, apart from the truncation and round-off errors involved in the FD derivatives, the numerical Jacobian matrix is a close approximation of the analytical Jacobian, even for a second-order-accurate discretization. This feature is crucial for retaining the quadratic convergence properties of Newton’s algorithm. On the other hand, it is considerably sparser than that obtained using more traditional FV discretizations, which typically extend up to distance-2 [1] or even distance-3 neighbors [8]. In these latter cases, contributions from the outermost gridpoints in the stencil have to be neglected [1], or at least lumped [8], when constructing the Jacobian approximation upon which the ILU(

## 4. Linear solve and preconditioning

The previous discussions have pointed that the solution of the large nonsymmetric sparse linear system (15) at each pseudo-time step is a major computational task of the whole flow simulation, especially when the mean flow and the turbulence transport equations are solved in fully coupled form, the Jacobian is computed *exactly* by means of FD, and the size of the time-step is rapidly increased to recover Newton’s algorithm. For convenience, we write system (15) in compact form as

where _{,} and *preconditioned system* writes in the form *preconditioner matrix*. In the coming sections, we describe the development of an effective algebraic preconditioner for the RANS model.

### 4.1. Multi-elimination ILU factorization preconditioner

Incomplete LU factorization methods (ILUs) are an effective, yet simple, class of preconditioning techniques for solving large linear systems. They write in the form *U* factors according to various strategies, see [3]. A stable ILU factorization is proved to exist for arbitrary choices of the sparsity pattern of

Multi-elimination ILU factorization is a powerful class of ILU preconditioners, which combines the simplicity of ILU techniques with the robustness and high degree of parallelism of domain decomposition methods [34]. It is developed on the idea that, due to sparsity, many unknowns of a linear system are not coupled by an equation (i.e. they are *independent*) and thus they can be eliminated simultaneously at a given stage of GE. If the, say

where

where

The factorization (25) defines a general framework which may accommodate for many different methods. An important distinction between various methods is rooted in the choice of the algorithm used to discover sets of independent unknowns. Many of these algorithms are borrowed from graph theory, where such sets are referred to as *independent sets*. Denoting as *vertex independent set* S is defined as a subset of

The set

The multilevel preconditioner considered in our study is the Algebraic Recursive Multilevel Solvers (ARMS) introduced in [25], which uses block independent sets computed by the simple greedy algorithm. Block independent sets are characterized by the property that unknowns of two different sets have no coupling, while unknowns within the same set may be coupled. In this case, the matrix

### 4.2. The variable-block ARMS factorization

The discretization of the Navier-Stokes equations for turbulent compressible flows assigns five distinct variables to each grid point (density, scaled energy, two components of the scaled velocity, and turbulence transport variable); these reduce to four for incompressible, constant density flows, and to three if additionally the flow is laminar. If the, say

*Memory.*A clear advantage is to store the matrix as a collection of blocks using the variable-block compressed sparse row (VBCSR) format, saving column indices and pointers for the block entries.*Stability.*On indefinite problems, computing with blocks instead of single elements enables a better control of pivot breakdowns, near singularities, and other possible sources of numerical instabilities. Block ILU solvers may be used instead of pointwise ILU methods.*Complexity.*Grouping variables in clusters, the Schur complement is smaller and hopefully the last reduced system is better conditioned and easier to solve.*Efficiency.*A full block implementation, based on higher level optimized BLAS as computational kernels, may be designed leading to better flops to memory ratios on modern cache-based computer architectures.*Cache effects.*Better cache reuse is possible for block algorithms.

It has been demonstrated that block iterative methods often exhibit faster convergence rate than their pointwise analogues for the solution of many classes of two- and three-dimensional partial differential equations (PDEs) [44, 45, 46]. For this reason, in the case of the simple Poisson’s equation with Dirichlet boundary conditions on a rectangle

The PArameterized BLock Ordering (PABLO) method proposed by O’Neil and Szyld is one of the first block reordering algorithms for sparse matrices [47]. The algorithm selects groups of nodes in the adjacency graph of the coefficient matrix such that the corresponding diagonal blocks are either full or very dense. It has been shown that classical block stationary iterative methods such as block Gauss-Seidel and SOR methods combined with the PABLO ordering require fewer operations than their point analogues for the finite element discretization of a Dirichlet problem on a graded L-shaped region, as well as on the 9-point discretization of the Laplacian operator on a square grid. The complexity of the PABLO algorithm is proportional to the number of nodes and edges in both time and space.

Another useful approach to compute dense blocks in the sparsity pattern of a matrix *indistinguishable nodes* or *cliques*. The algorithm assigns a *checksum* quantity to each vertex, using the function

and then sorts the vertices by their checksums. This operation takes

On the other hand, sparse unstructured matrices may sometimes exhibit *approximate dense blocks* consisting mostly of non-zero entries, except for a few zeros inside the blocks. By treating these few zeros as non-zero elements, with a little sacrifice of memory, a block ordering may be generated for an iterative solver. Approximate dense blocks in a matrix may be computed by numbering consecutively rows and columns having a similar non-zero structure. However, this would require a new checksum function that preserves the proximity of patterns, in the sense that close patterns would result in close checksum values. Unfortunately, this property does not hold true for Ashcraft’s algorithm in its original form. In [49], Saad proposed to compare angles of rows (or columns) to compute approximate dense structures in a matrix *average block density* (*average block size* (

Our recently developed variable-block variant of the ARMS method (VBARMS) incorporates an angle-based compression technique during the factorization to detect fine-grained dense structures in the linear system automatically, without any users knowledge of the underlying problem, and exploits them to improve the overall robustness and throughput of the basic multilevel algorithm [50]. It is simpler to describe VBARMS from a graph point of view. Suppose to permute

where the diagonal blocks

An edge connects two supervertices

The complete pre-processing and factorization process of VBARMS consists of the following steps.

**Step 1.** Find the block ordering

**Step 2.** Scale the matrix at Step 1 in the form

**Step 3.** Find the block independent sets ordering

We use a simple form of weighted greedy algorithm for computing the ordering

In the

**Step 4.** Factorize the matrix (30) in the form

where

The Schur complement is also block sparse and has the same block partitioning of

Steps 2–4 can be repeated on the reduced system a few times until the Schur complement is small enough. After one additional level, we obtain

that can be factored as

Denote as

needs to be solved, with

Calling

the unknown solution vector and the right-hand side vector of system (35), the solution process with the above multilevel VBARMS factorization consists of level-by-level forward elimination followed by an exact solution on the last reduced system and suitable inverse permutation. The solving phase is sketched in Algorithm 1.

In VBARMS, we perform the factorization approximately, for memory efficiency. We use block ILU factorization with threshold to invert inexactly both the upper leftmost matrix

The VBARMS code is developed in the C language and is adapted from the existing ARMS code available in the ITSOL package [54]. The compressed sparse storage format of ARMS is modified to store block vectors and block matrices of variable size as a collection of contiguous non-zero dense blocks (we refer to this data storage format as VBCSR). First, we compute the factors

**Algorithm 1** `VBARMS_Solve`(

**Require:**

1: Solve

2: Compute

3: **if** **then**

4: Solve

5: else

6: Call `VBARMS_Solve`(

7: **end if**

8: Solve

## 5. Numerical experiments

In this section, we illustrate the performance of the VBARMS method for solving a suite of block structured linear systems arising from an implicit Newton-Krylov formulation of the RANS equations in the turbulent incompressible flow analysis past a three-dimensional wing. On multicore machines, the quotient graph *interior nodes*, that are those coupled by the equations only with the local variables, and *interface nodes*, those that may be coupled with the local variables stored on processor

The vector of the local unknowns

The rows of

or, in expanded form, as

where _{,} and

At this stage, the VBARMS method described in Section 4.2 can be used as a local solver for different types of global preconditioners. In the simplest parallel implementation, the so-called block-Jacobi preconditioner, the sequential VBARMS method can be applied to invert approximately each local matrix

where

If the diagonal blocks of the matrix

where

The global preconditioning matrix

and named as the Restricted Additive Schwarz (RAS) preconditioner [3, 57]. Note that the preconditioning step still offers a good scope par parallelism, as the different components of the error update are formed independently. However, due to overlapping some communication is required in the final update, as the components are added up from each subdomain. In our experiments, the overlap used for RAS was the level 1 neighbors of the local nodes in the quotient graph.

A third global preconditioner that we consider in this study is based on the Schur complement approach. In Eq. (40), we can eliminate the vector of interior unknowns

where

The local Schur complement equations considered altogether write as the global Schur complement system

where the off-diagonal matrices

at the cost of one local solve. We solve the global system (47) by running a few steps of the GMRES method preconditioned by a block diagonal matrix, where the diagonal blocks are the local Schur complements

is obtained as by-product of the LU factorization of the local matrix

which is by the way required to compute the

### 5.1. Results

The parallel experiments were run on the large-memory nodes (32 cores/node and 1 TB of memory) of the TACC Stampede system located at the University of Texas at Austin. TACC Stampede is a 10 PFLOPS (PF) Dell Linux Cluster based on 6400+ Dell PowerEdge server nodes, each outfitted with 2 Intel Xeon E5 (Sandy Bridge) processors and an Intel Xeon Phi Coprocessor (MIC Architecture). We linked the default vendor BLAS library, which is MKL. Although MKL is multi-threaded by default, in our runs we used it in a single-thread mode since our MPI-based parallelisation employed one MPI process per core (communicating via the shared memory for the same-node cores). We used the Flexible GMRES (FGMRES) method [58] as Krylov subspace method, a tolerance of

In our experiments, we analyzed the turbulent incompressible flow past a three-dimensional wing illustrated in **Figure 2** using the *EulFS* code developed by the second author [59]. The geometry, called DPW3 Wing-1, was proposed in the 3rd AIAA Drag Prediction Workshop [35]. Flow conditions are 0.5**Tables 1** and **2** we show experiments with the parallel VBARMS solver on the five meshes of the DPW3 Wing-1 problem. On the largest mesh we report on only one experiment, in **Table 2**, as this is a resource demanding problem. In **Table 3** we report on a strong scalability study on the problem denoted as RANS2 by increasing the number of processors. Finally, in **Table 4** we show comparative results with parallel VBARMS against other popular solvers; the method denoted as pARMS is the solver described in [55] using default parameters while the method VBILUT is a variable-block incomplete lower-upper factorization with threshold from the ITSOL package [54]. The results of our experiments show that the proposed preconditioner is effective to reduce the number of iterations especially in combination with the Restricted Additive Schwarz method, and exhibits good parallel scalability. A truly parallel implementation of the VBARMS method that may offer better numerical scalability will be considered as the next step of this research.

## 6. Conclusions

The applicability of Newton’s method in steady flow simulations is often limited by the difficulty to compute a good initial solution, namely, one lying in a reasonably small neighborhood of the sought solution. This problem can now be overcome by introducing some approximations in the first stages of the solution procedure. In the case of unsteady flow problems, on the other hand, the use of Newton’s method in conjunction with a dual-time stepping procedure is even more effective since the flow field computed at the preceding physical time level is likely to be sufficiently close to the sought solution at the next time level to allow the use of Newton’s algorithm right from the beginning of the sub-iterations in pseudo-time. On the downside of Newton-Krylov methods is the need for efficiently preconditioned iterative algorithms to solve the sparse linear system arising at each inner iteration (Newton step). The stiffness of the linear systems to be solved increases when the Jacobian is computed “exactly” and the turbulence transport equations are solved fully coupled with the mean flow equations.

In this chapter, we have presented a block multilevel incomplete factorization preconditioner for solving sparse systems of linear equations arising from the implicit RANS formulation. The method detects automatically any existing block structure in the matrix, without any user’s prior knowledge of the underlying problem, and exploits it to maximize computational efficiency. The results of this chapter show that, by taking advantage of this block structure, the solver can be more robust and efficient. Other recent studies on block ILU preconditioners have drawn similar conclusions on the importance of exposing dense blocks during the construction of the incomplete LU factorization for better performance, in the design of incomplete multifrontal LU-factorization preconditioners [61] and adaptive blocking approaches for blocked incomplete Cholesky factorization [62]. We believe that the proposed VBARMS method can be useful for solving linear systems also in other areas, such as in Electromagnetics applications [63, 64, 65].