Multilevel Variable-Block Schur-Complement-Based Preconditioning for the Implicit Solution of the Reynolds-Averaged Navier-Stokes Equations Using Unstructured Grids Multilevel Variable-Block Schur-Complement-Based Preconditioning for the Implicit Solution of the Reynolds-Averaged Navier-Stokes Equations Using Unstructured Grids

Implicit methods based on the Newton ’ s rootfinding algorithm are receiving an increas- ing attention for the solution of complex Computational Fluid Dynamics (CFD) applications due to their potential to converge in a very small number of iterations. This approach requires fast convergence acceleration techniques in order to compete with other conventional solvers, such as those based on artificial dissipation or upwind schemes, in terms of CPU time. In this chapter, we describe a multilevel variable-block Schur-complement-based preconditioning for the implicit solution of the Reynolds- averaged Navier-Stokes equations using unstructured grids on distributed-memory parallel computers. The proposed solver detects automatically exact or approximate dense structures in the linear system arising from the discretization, and exploits this information to enhance the robustness and improve the scalability of the block factori- zation. A complete study of the numerical and parallel performance of the solver is presented for the analysis of turbulent Navier-Stokes equations on a suite of three- dimensional test cases.


Introduction
A considerable number of modern high-fidelity Computational Fluid Dynamics (CFD) solvers and codes still adopt either one-dimensional physical models based on the Riemann problem using higher order shape functions, such as higher order Finite Volume (FV) and Discontinuous Galerkin Finite Element (FE) methods for the discrete data representation, or truly multidimensional physical models using linear shape functions, like Fluctuation Splitting (FS) schemes. Both of these approaches require fast convergence acceleration techniques in order to compete with conventional solvers based on artificial dissipation or upwind schemes in terms of CPU time. Implicit methods based on the Newton's rootfinding algorithm are receiving an increasing attention in this context for the solution of complex real-world CFD applications, for example in the analyses of turbulent flows past three-dimensional wings, due to their potential to converge in a very small number of iterations [1,2]. In this chapter, we consider convergence acceleration strategies for the implicit solution of the Reynolds-averaged Navier-Stokes (RANS) equations based on the FS space discretization using a preconditioned Newton-Krylov algorithm for the integration. The use of a Newton solver requires the inversion of a large nonsymmetric system of equations at each step of the non-linear solution process. Choice of linear solver and preconditioner is crucial for efficiency especially when the mean flow and the turbulence transport equation are solved in fully coupled form. In this study, we use the restarted Generalized Minimal Residuals (GMRES) [3] algorithm for the inner linear solver, preconditioned by a block multilevel incomplete lower-upper (LU) factorization. We present the development lines of the multilevel preconditioning strategy that is efficient to reduce the number of iterations of Krylov subspace methods at moderate memory cost, and shows good parallel performance on three-dimensional turbulent flow simulations.
The chapter is structured as follows. The governing conservation equations for both compressible and incompressible flows are reviewed in Section 2. Section 3 briefly describes the fluctuation splitting space discretization, the time discretization and the Newton-Krylov method used to solve the space-and time-discretized set of governing partial differential equations (PDEs). In Section 4, we present the development of the multilevel preconditioning strategies for the inner linear solver. We illustrate the numerical and parallel performance of the preconditioner for the analysis of turbulent incompressible flows past a three-dimensional wing in Section 5. Some concluding remarks arising from the study are presented in Section 6.

Governing equations
In the case of inviscid and laminar flows, given a control volume C i , fixed in space and bounded by the control surface ∂C i with inward normal n, the governing equations of fluid dynamics are obtained by considering the conservation of mass, momentum and energy. In the case of viscous turbulent flows, one approach to consider the effects of turbulence is to average the unsteady Navier-Stokes (NS) equations on the turbulence time scale. Such averaging procedure results in a new set of steady equations (the RANS equations) that differ from the steady NS equations for the presence of the Reynolds' stress tensor, representing the effects of turbulence on the averaged flow field. The appearance of this tensor yields a closure problem, which is often solved by adopting an algebraic or a differential turbulence model. In the present work, we use the Spalart-Allmaras [4] one-equation model for the turbulent viscosity. Thus the integral form of the conservation law of mass, momentum, energy and turbulence transport equations has the form where U is the vector of conserved variables. For compressible flows, we have U ¼ r; re 0 ; À ru;ν Á T , and for incompressible, constant density flows, U ¼ p; u;ν ð Þ T : The operators F and G represent the inviscid and viscous fluxes, respectively; for compressible flows, we have and for incompressible, constant density flows, Finally, the source term S has a non-zero entry only in the row corresponding to the turbulence transport equation; its expression is not reported here for brevity, but can be found in [4]. Note that the standard NS equations are retrieved from (1) by removing the source term S and the differential equation associated with the turbulence variable, and setting the effective viscosity and thermal conductivity to their laminar values. The Euler equations are instead recovered by additionally removing the flux vector G.

Solution techniques
The model used in this study for the discrete data representation is based on the coupling of an hybrid class of methods for the space discretization, called Fluctuation Splitting (or residual distribution) schemes [5], and a fully coupled Newton algorithm. By "fully coupled" we mean that the mass, momentum and energy conservation equations on one hand, and the turbulent equation on the other, are solved simultaneously rather than in a decoupled or staggered fashion. We discuss in the following subsections, separately, the space and time discretization, the numerical integration of the set of equations resulting from the discretization, and the solution of the large linear system at each Newton's step.

Space discretisation
The Fluctuation Splitting approach has features common to both Finite Element (FE) and Finite Volume (FV) methods. Like in standard FE methods, the dependent variables are stored at the vertices of the computational mesh made up of triangles in the two-dimensional (2D) space, and tetrahedra in three-dimensional (2D), and are assumed to vary linearly and continuously in space. Denoting Z i as the nodal value of the dependent variable at the grid point i and N i as the FE linear shape function, this dependence can be written as Note that, although the summation in Eq. (4) extends over all grid nodes, the computational molecule of each node is actually limited only to the set of its nearest neighbors due to the compact support of the linear shape functions. In the compressible case, Roe's parameter vector is chosen as the dependent variable to ensure discrete conservation [6]. In the incompressible case, discrete conservation is obtained by simply setting the dependent variable Z equal to the vector of conserved variables U. In our code, we group the dependent variables per gridpoint. The first m entries of the array Z are filled with the m flow variables of gridpoint 1, and these are followed by those of gridpoint 2, and so on. Blocking the flow variables in this way, also referred to as "field interlacing" in the literature, is acknowledged [7][8][9] to result in better performances than grouping variables per aerodynamic quantity.
The integral Eq. (1) is discretized over each control volume C i using a FV-type approach. In two dimensions, the control volumes C i are drawn around each gridpoint by joining the centroids of gravity of the surrounding cells with the midpoints of all the edges that connect that gridpoint with its nearest neighbors. An example of polygonal-shaped control volumes (so-called median dual cells) is shown by green lines in Figure 1(a). With FS schemes, rather than calculating the inviscid fluxes by numerical quadrature along the boundary ∂C i of the median dual cell, as would be done with conventional FV schemes, the net inviscid flux Φ e, inv over each triangular/tetrahedral element is evaluated by means of a conservative linearization based on the parameter vector [6], and scattered to the element vertices using elemental distribution matrices B e i [5]. The inviscid contribution to the nodal residual R Φ ð Þ i is then assembled by collecting fractions Φ e, inv i of the net inviscid fluxes Φ e, inv associated with all the elements by which the node i is surrounded. This is schematically shown in Figure 1(b). Concerning the viscous terms, the corresponding flux balance is evaluated by surface integration along the boundaries of the median dual cell: node i receives a contribution Φ e, vis i from cell e which accounts for the viscous flux through the portion of ∂C i that belongs to that cell. This approach can be shown to be equivalent to a Galerkin FE discretization.
Summing up the inviscid and viscous contributions to the nodal residual of gridpoint i one obtains In Eq. (7), the summation ranges over all the elements e that meet in meshpoint i, as shown in Figure 1(b). The construction of the distribution matrices B e i involves the solution of d þ 1 small (of order m) dense linear systems for each triangular/tetrahedral element of the mesh thus making FS schemes somewhat more expensive than state-of-the-art FV schemes based upon either central differencing with artificial dissipation or upwind discretizations. The relatively high computational cost of FS discretizations has to be accounted for when deciding whether the Jacobian matrix should be stored in memory or a Jacobian-free method be used instead.

Time discretisation
One route to achieve second-order time accuracy with FS schemes is to use mass matrices that couple the time derivatives of neighboring grid points. This leads to implicit schemes, even if the spatial residual were treated explicitly. Although a more general framework for the derivation of the mass matrices can be devised [10], the approach adopted in our study consists of formulating the FS scheme as a Petrov-Galerkin FE method with elemental weighting function given by The contribution of element e to the weighted residual equation for grid point i reads ð Te where the chain rule is used to make the dependent variable Z appear. Since the conservative variables U are quadratic functions of the parameter vector Z, the transformation matrix ∂U=∂Z is linear in Z and can thus be expanded using the linear shape functions N j , just as in Eq. (4). A similar expansion applies to the time-derivative ∂Z=∂t. Replacing both expansions in the RHS of Eq. (9), the discrete counterpart of the time derivative of Eq. (9) is given by the contribution of all elements sharing the node i: ð In Eq. (10) the index j spans the vertices of the element e and the nodal values ∂Z=∂t ð Þ j are approximated by the three-level Finite Difference (FD) formula The matrix M e ij in Eq. (10) is the contribution of element e to the entry in the i th row and j th column of the global mass matrix M ½ . Similarly to what is done in the assembly of the inviscid and viscous flux balance, Eq. (7), the discretization of the unsteady term, Eq. (10), is obtained by collecting elemental contributions from all the elements that surround the node i. Secondorder space and time accuracy of the scheme described above has been demonstrated for inviscid flow problems by Campobasso et al. [11] using an exact solution of the Euler equations.

Numerical integration
Writing down the space-and time-discretized form of Eq. (1) for all gridpoints of the mesh, one obtains the following large, sparse system of non-linear algebraic equations to be solved at time level n þ 1 to obtain the unknown solution vector U nþ1 . The solution of Eq. (12) is obtained by means of an implicit approach based on the use of a fictitious timederivative (Jameson's dual time-stepping [12]) that amounts to solve the following evolutionary problem in pseudo-time τ until steady state is reached. Since accuracy in pseudo-time is obviously irrelevant, the mass matrix has been lumped into the diagonal matrix V M ½ and a first-order accurate, two-time levels FD formula is used to approximate the pseudo-time derivative in the LHS of Eq. (13). The outer iterations counter k has been introduced in Eq. (14) to label the pseudo-time levels.
Upon replacing Eq. (14) in Eq. (13), an implicit scheme is obtained if the residual R g is evaluated at the unknown pseudo-time level k þ 1. Taylor expanding R g about time level k , one obtains the following sparse system of linear equations to be solved at each outer iteration until the required convergence of R g is obtained. Steady RANS simulations are accommodated within the presented integration scheme by dropping the physical time-derivative term in Eq. (12). In the limit Δτ k ! ∞, Eq. (15) recovers Newton's rootfinding algorithm, which is known to yield quadratic convergence when the initial guess U nþ1, 0 ¼ U n is sufficiently close to the sought solution U nþ1 . This is likely to occur when dealing with unsteady flow problems because the solution of the flow field at a given physical time starts from the converged solution at the preceding time, and this latter constitutes a very convenient initial state. In fact, it is sufficiently close to the sought new solution to allow the use of the exact Newton's method (i.e. Δτ k ¼ ∞ in Eq. (15)) since the first solution step.
The situation is different when dealing with steady flow problems. Newton's method is only locally convergent, meaning that it is guaranteed to converge to a solution when the initial approximation is already close enough to the sought solution. This is generally not the case when dealing with steady flows, and a "globalization strategy" needs to be used in order to avoid stall or divergence of the outer iterations. The choice commonly adopted by various authors [13,14], and in this study as well, is a pseudo-transient continuation, which amounts to retain the pseudo-transient term in Eq. (15). At the early stages of the iterative process, the pseudo-time step length Δτ k in Eq. (15) is kept small. The advantage is twofold: on one hand, it helps preventing stall or divergence of the outer iterations; on the other hand, it makes the linear system (15) easier to solve by means of an iterative solver since for moderate values of Δτ k the term V m ½ =Δτ k increases the diagonal dominance of the matrix. Once the solution has come close to the steady state, which can be monitored by looking at the norm of the nodal residual R Φ , we let Δτ k grow unboundedly so that Newton's method is eventually recovered during the last steps of the iterative process. The time step length Δτ k is selected according to the Switched Evolution Relaxation (SER) strategy proposed by Mulder and van Leer [15], as follows: where Δτ is the pseudo-time step based upon the stability criterion of the explicit time integration scheme, and C 0 and C max are user-defined constants controlling the initial and maximum pseudo-time steps used in the actual calculations.
In the early stages of the iterative process, the turbulent transport equation and the mean flow equations are solved in tandem (or in a loosely coupled manner, following the nomenclature used by Zingg et al. [16]): the mean flow solution is advanced over a single pseudo-time step using an analytically computed, but approximate Jacobian while keeping turbulent viscosity frozen, then the turbulent variable is advanced over one or more pseudo-time steps using a FD Jacobian with frozen mean flow variables. Due to the uncoupling between the mean flow and turbulent transport equations, this procedure will eventually converge to steady state, but never yields quadratic convergence. Close to steady state, when a true Newton strategy preceded by a "short" pseudo-transient continuation phase can be adopted, the mean flow and the turbulence transport equation are solved in fully coupled form, and the Jacobian is computed by FD. For the sake of completeness, we give further details of each of these two steps in the following two paragraphs.

Tandem solution strategy with (approximate) Picard linearization
Consider re-writing the steady nodal residual R Φ , see [17] for full details, as where C ½ and D ½ are (sparse) matrices that account for the convective and diffusive contributions to the nodal residual vector R Φ . Matrix D ½ is constant for isothermal, incompressible flows whereas it depends upon the flow variables through molecular viscosity in the case of compressible flows. Matrix C ½ depends upon U for both compressible and incompressible flows. Both matrices can be computed analytically as described in [17]. What we refer to as a Picard linearization consists in the following approximation which amounts to neglect the dependence of matrices C ½ and D ½ upon U when differentiating the residual, written as in Eq. (17).
Once the mean flow solution has been advanced over a single pseudo-time step using the approximate Picard linearization, keeping the turbulent viscosity frozen, the turbulent variable is advanced over one or more (typically ten) pseudo-time steps using a FD Jacobian approximation (described in Section 3.3.2) with frozen mean flow variables. Blanco and Zingg [18] adopt a similar strategy, but keep iterating the turbulence transport equation until its residual has become lower than that of the mean flow equations. The loosely coupled solution strategy is a choice often made as it "allows for the easy interchange of new turbulence models" [19] and also reduces the storage [18], compared to a fully coupled approach. However, due to the uncoupling between the mean flow and the turbulent transport equations, the tandem solution strategy never yields quadratic convergence nor it is always able to drive the nodal residual to machine zero. The last statement cannot be generalized, since Blanco and Zingg [16,20] report convergence to machine zero for their loosely coupled approach on two-dimensional unstructured grids. However, even if convergence to machine zero is difficult to achieve, the nodal residual is always sufficiently converged for any practical "engineering" purpose and close enough to "true" steady-state solution to be a good initial guess for Newton's method.

Fully coupled solution strategy with FD Newton linearization
Once the tandem solution strategy has provided a good approximation to the steady flow, or when dealing with unsteady flows, in which case the solution at a given time level is generally a good approximation to the one sought at the next time level, it becomes very attractive to take advantage of the quadratic convergence of Newton's method. In order to do so, however, the mean flow and the turbulence transport equations must be solved fully coupled and the Jacobian matrix J ½ must be accurate. We take advantage of the compactness of the computational stencil required by FS schemes to compute a close approximation to the true Jacobian matrix even for second-order accurate discretizations. The analytical evaluation of the Jacobian matrix, though not impossible [5,21], is rather cumbersome and thus this approach is not pursued here.
When the equations are fully coupled, the structure of the Jacobian matrix J ½ is naturally organized into small dense blocks of order m. This has implications both in terms of storage, since it is possible to reduce the length of the integer pointers that define the Compressed Sparse Row (CSR) data structure of the sparse matrix, and also in the design of the preconditioner for solving the large linear system at each Newton step, where division operations can be efficiently replaced by block matrix factorizations. We will address these issues in detail in the next section. Two neighboring gridpoints, i and j, in the mesh will contribute two block entries, J ij and J ji , to the global Jacobian matrix J ½ . Each of these two block entries (say J ij , for instance) will be computed by assembling elemental contributions coming from all the cells that share vertex i, as follows Eq. (19) follows by applying the sum rule of differentiation and by observing that the nodal residual itself is a sum of contributions from the elements that share vertex i, see Figure 1(b). Specifically, element J e ij accounts for the contribution of cell e to the residual change at gridpoint i, due to a change in the state vector of a neighboring gridpoint j that belongs to the same element e. The contribution of cell e to the element p; q ð Þ of the block J ij is computed from the following one-sided FD formula where R e where ε is a "small" quantity. Due to the use of a one-sided FD formula, the FD approximation (20) of the Jacobian entry is affected by a truncation error which is proportional to the first power of ε. Small values of ε keep the truncation error small, but too small values may lead to round-off errors. Following [21], ε is computed as From a coding viewpoint, the same loop over all cells used to build the nodal residual R g is also used to assemble matrix J ½ . The operations to be performed within each cell are the following: i) perturb each of the m components of the conserved variables vector of the d þ 1 vertices of cell e; ii) evaluate the residual contribution to each of the vertices; iii) calculate the Jacobian entries according to Eq. (20). While looping over cell e, this contributes d þ 1 ð Þ 2 block entries to the global matrix J ½ . Moreover, it follows that the cost of a Jacobian evaluation is equal to m Â d þ 1 ð Þresidual evaluations, which can be quite a large number. For instance, for a 3D compressible RANS calculation using a one-equation turbulence model, m Â d þ 1 ð Þ¼24. In this study, it was decided to store the Jacobian matrix in memory rather than using a Jacobian-free (JFNK), as the Jacobian matrix is relatively sparse even for a second-order accurate discretization due to the compactness of the stencil. The JFNK approach avoids assembling and, more important, storing the Jacobian matrix. However, the matrix-vector product is replaced with the Jacobian matrix by FD formulae which requires extra costly FS residual evaluations. Note that the JFNK method still requires the construction of a preconditioner to be used by the iterative linear solver, often an Incomplete Lower Upper factorization, which is typically constructed using a lower order approximation of the residual vector. Matrix-free preconditioners might also be used [22], saving a huge storage at the expense of extra CPU cost. These latter are referred to as MFNK methods. Although JFNK or MFNK approaches should certainly be favored from the viewpoint of memory occupation, it cannot always be "assumed that the Jacobian-free matrix-vector products are inherently advantageous in terms of computing time" [13].
The compactness of the FS stencil, which never extends beyond the set of distance-1 neighbors even for a second-order-accurate space-time discretization, offers two advantages. On one hand, apart from the truncation and round-off errors involved in the FD derivatives, the numerical Jacobian matrix is a close approximation of the analytical Jacobian, even for a second-order-accurate discretization. This feature is crucial for retaining the quadratic convergence properties of Newton's algorithm. On the other hand, it is considerably sparser than that obtained using more traditional FV discretizations, which typically extend up to distance-2 [1] or even distance-3 neighbors [8]. In these latter cases, contributions from the outermost gridpoints in the stencil have to be neglected [1], or at least lumped [8], when constructing the Jacobian approximation upon which the ILU(ℓ) preconditioner is built. These approximations are a potential source of performance degradation as reported in [8]. The memory occupation required to store the Jacobian matrix still remains remarkable. Moreover, not only the Jacobian matrix, but also its preconditioner needs to be stored in the computer memory. It is therefore clear that a key ingredient that would help reducing memory occupation is an effective preconditioner having a relatively small number of non-zero entries, as close as possible to that of the Jacobian matrix. This demanding problem is addressed in the next section.

Linear solve and preconditioning
The previous discussions have pointed that the solution of the large nonsymmetric sparse linear system (15) at each pseudo-time step is a major computational task of the whole flow simulation, especially when the mean flow and the turbulence transport equations are solved in fully coupled form, the Jacobian is computed exactly by means of FD, and the size of the time-step is rapidly increased to recover Newton's algorithm. For convenience, we write system (15) in compact form as where A ¼ a ij Â Ã is the large and sparse coefficient matrix of, say, size n , and b is the right-hand side vector. It is well established that, when A is highly nonsymmetric and/or indefinite, iterative methods need the assistance of preconditioning to transform system (23) into an equivalent one that is more amenable to an iterative solver. The transformed preconditioned system writes in the form M À1 Ax ¼ M À1 b when preconditioning is applied from the left, or AM À1 y ¼ b with x ¼ M À1 y when preconditioning is applied from the right. The matrix M is a nonsingular approximation to A called the preconditioner matrix. In the coming sections, we describe the development of an effective algebraic preconditioner for the RANS model.

Multi-elimination ILU factorization preconditioner
Incomplete LU factorization methods (ILUs) are an effective, yet simple, class of preconditioning techniques for solving large linear systems. They write in the form M ¼ L U, where L and U are approximations of the L and U factors of the standard triangular LU decomposition of A. The incomplete factorization may be computed directly from the Gaussian Elimination (GE) algorithm, by discarding some entries in the L and U factors according to various strategies, see [3]. A stable ILU factorization is proved to exist for arbitrary choices of the sparsity pattern of L and U only for particular classes of matrices, such as M-matrices [23] and H-matrices with positive diagonal entries [24]. However, many techniques can help improve the quality of the preconditioner on more general problems, such as reordering, scaling, diagonal shifting, pivoting and condition estimators [25][26][27][28]. As a result of this recent development, in the past decade successful experience have been reported using ILU preconditioners in areas that were of exclusive domain of direct solution methods like, in circuits simulation, power system networks, chemical engineering plants modeling, graphs and other problems not governed by PDEs, or in areas where direct methods have been traditionally preferred, such as structural analysis, semiconductor device modeling, computational fluid dynamics (see [29][30][31][32][33]).
Multi-elimination ILU factorization is a powerful class of ILU preconditioners, which combines the simplicity of ILU techniques with the robustness and high degree of parallelism of domain decomposition methods [34]. It is developed on the idea that, due to sparsity, many unknowns of a linear system are not coupled by an equation (i.e. they are independent) and thus they can be eliminated simultaneously at a given stage of GE. If the, say m, independent unknowns are numbered first, and the other n À m unknowns last, the coefficient matrix of the system is permuted in a 2Â2 block structure of the form where D is a diagonal matrix of dimension m and C is a square matrix of dimension n À m. In multi-elimination methods, a reduced system is recursively constructed from (24) by computing a block LU factorization of PAP T of the form where L and U are the triangular factors of the LU factorization of D, A 1 ¼ C À ED À1 F is the Schur complement with respect to C, I nÀm is the identity matrix of dimension n À m, and we denote G ¼ EU À1 and W ¼ L À1 F. The reduction process can be applied another time to the reduced system with A 1 , and recursively to each consecutively reduced system until the Schur complement is small enough to be solved with a standard method such as a dense LAPACK solver [35]. Multi-elimination ILU factorization preconditioners may be obtained from the decomposition (25) by performing the reduction process inexactly, by dropping small entries in the Schur complement matrix and/or factorizing D approximately at each reduction step. These preconditioners exhibit better parallelism than conventional ILU algorithms, due to the recursive factorization. Additionally, for comparable memory usage, they may be significantly more robust especially for solving large problems as the reduced system is typically small and better conditioned compared to the full system.
The factorization (25) defines a general framework which may accommodate for many different methods. An important distinction between various methods is rooted in the choice of the algorithm used to discover sets of independent unknowns. Many of these algorithms are borrowed from graph theory, where such sets are referred to as independent sets. Denoting as ; v n f gis the set of vertices and E the set of edges, a vertex independent set S is defined as a subset of V such that The set S is maximal if there is no other independent set containing S strictly [36]. Independent sets in a graph may be computed by simple greedy algorithms which traverse the vertices in the natural order 1, 2, …, n, mark each visited vertex v and all of its nearest neighbors connected to v by an edge, and add v and each visited node that is not already marked to the independent set [37]. As an alternative to the greedy algorithm, the nested dissection ordering [38], mesh partitioning, or further information from the set of nested finite element grids of the underlying problem can be used [39][40][41].
The multilevel preconditioner considered in our study is the Algebraic Recursive Multilevel Solvers (ARMS) introduced in [25], which uses block independent sets computed by the simple greedy algorithm. Block independent sets are characterized by the property that unknowns of two different sets have no coupling, while unknowns within the same set may be coupled. In this case, the matrix D appearing in (24) is block diagonal, and may typically consist of large-sized diagonal blocks that are factorized by an ILU factorization with threshold (ILUT [42]) for memory efficiency. In the ARMS implementation described in [25], first the incomplete triangular factors L, U of D are computed by one sweep of ILUT, and an approximation W to L À1 F is also computed.
In a second loop, an approximation G to EU À1 and an approximate Schur complement matrix A 1 are derived. This holds at each reduction level. At the last level, another sweep of ILUT is applied to the (last) reduced system. The blocks W and G are stored temporarily, and then discarded from the data structure after the Schur complement matrix is computed. Only the incomplete factors of D at each level, those of the last level Schur matrix, and the permutation arrays are needed for the solving phase. By this implementation, dropping can be performed separately in the matrices L, U, W, G, A 1 . This in turns allows to factor D accurately without incurring additional costs in G and W, achieving high computational and memory efficiency. Implementation details and careful selection of the parameters are always critical aspects to consider in the design of sparse matrix algorithms. Next, we show how to combine the ARMS method with matrix compression techniques to exploit the block structure of A for better efficiency.

The variable-block ARMS factorization
The discretization of the Navier-Stokes equations for turbulent compressible flows assigns five distinct variables to each grid point (density, scaled energy, two components of the scaled velocity, and turbulence transport variable); these reduce to four for incompressible, constant density flows, and to three if additionally the flow is laminar. If the, say ℓ, distinct variables associated with the same node are numbered consecutively, the permuted matrix has a sparse block structure with non-zero blocks of size ℓ Â ℓ. The blocks are usually fully dense, as variables at the same node are mutually coupled. Exploiting any available block structure in the preconditioner design may bring several benefits [43], some of them are explained below: 1. Memory. A clear advantage is to store the matrix as a collection of blocks using the variableblock compressed sparse row (VBCSR) format, saving column indices and pointers for the block entries.

2.
Stability. On indefinite problems, computing with blocks instead of single elements enables a better control of pivot breakdowns, near singularities, and other possible sources of numerical instabilities. Block ILU solvers may be used instead of pointwise ILU methods.

3.
Complexity. Grouping variables in clusters, the Schur complement is smaller and hopefully the last reduced system is better conditioned and easier to solve.

Efficiency.
A full block implementation, based on higher level optimized BLAS as computational kernels, may be designed leading to better flops to memory ratios on modern cache-based computer architectures.

Cache effects.
Better cache reuse is possible for block algorithms. It has been demonstrated that block iterative methods often exhibit faster convergence rate than their pointwise analogues for the solution of many classes of two-and three-dimensional partial differential equations (PDEs) [44][45][46]. For this reason, in the case of the simple Poisson's equation with Dirichlet boundary conditions on a rectangle 0; ℓ 1 ð ÞÂ 0; ℓ 2 ð Þdiscretized uniformly by using n 1 þ 2 points in the interval 0; ℓ 1 ð Þand n 2 þ 2 points in 0; ℓ 2 ð Þ, it is often convenient to number the interior points by lines from the bottom up in the natural ordering, so that one obtains a n 2 Â n 2 block tridiagonal matrix with square blocks of size n 1 Â n 1 ; the diagonal blocks are tridiagonal matrices and the off-diagonal blocks are diagonal matrices. For large finite element discretizations, it is common to use substructuring, where each substructure of the physical mesh corresponds to one sparse block of the system. If the domain is highly irregular or the matrix does not correspond to a differential equation, finding the best block partitioning is much less obvious. In this case, graph reordering techniques are worth considering.
The PArameterized BLock Ordering (PABLO) method proposed by O'Neil and Szyld is one of the first block reordering algorithms for sparse matrices [47]. The algorithm selects groups of nodes in the adjacency graph of the coefficient matrix such that the corresponding diagonal blocks are either full or very dense. It has been shown that classical block stationary iterative methods such as block Gauss-Seidel and SOR methods combined with the PABLO ordering require fewer operations than their point analogues for the finite element discretization of a Dirichlet problem on a graded L-shaped region, as well as on the 9-point discretization of the Laplacian operator on a square grid. The complexity of the PABLO algorithm is proportional to the number of nodes and edges in both time and space.
Another useful approach to compute dense blocks in the sparsity pattern of a matrix A is the method proposed by Ashcraft in [48]. The algorithm searches for sets of rows or columns having the exact same pattern. From a graph viewpoint, it looks for vertices of the adjacency graph V; E ð Þof A having the same adjacency list. These are also called indistinguishable nodes or cliques. The algorithm assigns a checksum quantity to each vertex, using the function and then sorts the vertices by their checksums. This operation takes |E| þ |V| log |V| time. If u and v are indistinguishable, then chk u ð Þ ¼ chk v ð Þ. Therefore, the algorithm examines nodes having the same checksum to see if they are indistinguishable. The ideal checksum function would assign a different value for each different row pattern that occurs but it is not practical because it may quickly lead to huge numbers that may not even be machine-representable. Since the time cost required by Ashcraft's method is generally negligible relative to the time it takes to solve the system, simple checksum functions such as (27) are used in practice [48].
On the other hand, sparse unstructured matrices may sometimes exhibit approximate dense blocks consisting mostly of non-zero entries, except for a few zeros inside the blocks. By treating these few zeros as non-zero elements, with a little sacrifice of memory, a block ordering may be generated for an iterative solver. Approximate dense blocks in a matrix may be computed by numbering consecutively rows and columns having a similar non-zero structure. However, this would require a new checksum function that preserves the proximity of patterns, in the sense that close patterns would result in close checksum values. Unfortunately, this property does not hold true for Ashcraft's algorithm in its original form. In [49], Saad proposed to compare angles of rows (or columns) to compute approximate dense structures in a matrix A. Let C be the pattern matrix of A, which by definition has the same pattern as A and the nonzero values are equal to 1. The method proposed by Saad computes the upper triangular part of CC T . Entry i; j ð Þ is the inner product (the cosine value) between row i and row j of C for j > i. A parameter τ is used to gauge the proximity of row patterns. If the cosine of the angle between rows i and j is smaller than τ, row j is added to the group of row i. For τ ¼ 1 the method will compute perfectly dense blocks, while for τ < 1 it may compute larger blocks where some zero entries are padded in the pattern. To speed up the search, it may be convenient to run a first pass with the checksum algorithm to detect rows having an identical pattern, and group them together; then, in a second pass, each non-assigned row is scanned again to determine whether it can be added to an existing group. Two important performance measures to gauge the quality of the block ordering computed are the average block density (av_bd) value, defined as the amount of non-zeros in the matrix divided by the amount of elements in the non-zero blocks, and the average block size (av_bs) value, which is the ratio between the sum of dimensions of the square diagonal blocks divided by the number of diagonal blocks. The cost of Saad's method is closer to that of checksum-based methods for cases in which a good blocking already exists, and in most cases it remains inferior to the cost of the least expensive block LU factorization, i.e. block ILU(0).
Our recently developed variable-block variant of the ARMS method (VBARMS) incorporates an angle-based compression technique during the factorization to detect fine-grained dense structures in the linear system automatically, without any users knowledge of the underlying problem, and exploits them to improve the overall robustness and throughput of the basic multilevel algorithm [50]. It is simpler to describe VBARMS from a graph point of view. Suppose to permute A in block form as where the diagonal blocksÃ ii , i ¼ 1, …, p are n i Â n i and the off-diagonal blocksÃ ij are n i Â n j .
We use upper case letters to denote matrix sub-blocks and lower case letters for individual matrix entries. We may represent the adjacency graph ofÃ by the quotient graph of A þ A T [36]. Calling B the partition into blocks given by (28), we denote as G=B ¼ V B ; E B f g the quotient graph obtained by coalescing the vertices assigned to the blockÃ ii (for i ¼ 1, …, p) into a supervertex Y i . In other words, the entry in position i; j ð Þ ofÃ is a block of dimension |Y i | Â |Y j |, where |X| is the cardinality of the set X. With this notation, the quotient graph An edge connects two supervertices Y i and Y j if there exists an edge from a vertex in A ii to a vertex in A jj in the graph V; E f gof A þ A T .
The complete pre-processing and factorization process of VBARMS consists of the following steps.
Step 1. Find the block ordering P B of A such that, upon permutation, the matrix P B AP T B has fairly dense non-zero blocks. We use the angle-based graph compression algorithm proposed by Saad and described earlier to compute exact or approximate block structures in A.
Step 2. Scale the matrix at Step 1 in the form S 1 P B AP T B S 2 using two diagonal matrices S 1 and S 2 , so that the 1-norm of the largest entry in each row and column is smaller or equal than 1.
Step 3. Find the block independent sets ordering P I of the quotient graph G=B ¼ V B ; E B f g . Apply the permutation to the matrix obtained at Step 2 as We use a simple form of weighted greedy algorithm for computing the ordering P I . The algorithm is the same as the one used in ARMS, and described in [25]. It consists of traversing the vertices G=B in the natural order 1, 2, …, n, marking each visited vertex v and all of its nearest neighbors connected to v by an edge and adding v and each visited node that is not already marked to the independent set. We assign the weight ∥Y∥ F to each supervertex Y.
In the 2 Â 2 partitioning (30), the upper left-most matrix D is block diagonal like in ARMS. However, due to the block permutation, the diagonal blocks of D are additionally block sparse matrices, as opposed to simply sparse matrices in ARMS and in other forms of multilevel incomplete LU factorizations, see [51,52]. The matrices F, E, C are also block sparse because of the same reason.
Step 4. Factorize the matrix (30) in the form where I is the identity matrix of appropriate size, and form the reduced system with the Schur complement The Schur complement is also block sparse and has the same block partitioning of C.
Steps 2-4 can be repeated on the reduced system a few times until the Schur complement is small enough. After one additional level, we obtain ð33Þ that can be factored as Denote as A ℓ the reduced Schur complement matrix at level ℓ, for ℓ > 1. After scaling and preordering A ℓ , a system with the matrix needs to be solved, with Calling the unknown solution vector and the right-hand side vector of system (35), the solution process with the above multilevel VBARMS factorization consists of level-by-level forward elimination followed by an exact solution on the last reduced system and suitable inverse permutation. The solving phase is sketched in Algorithm 1.
In VBARMS, we perform the factorization approximately, for memory efficiency. We use block ILU factorization with threshold to invert inexactly both the upper leftmost matrix D ℓ ≈ L ℓ U ℓ at each level ℓ, and the last level Schur complement matrix A ℓmax ≈ L S U S . The block ILU method used in VBARMS is a straightforward block variant of the one-level pointwise ILUT algorithm.
We drop small blocks B ∈ R mBÂnB in L ℓ , U ℓ , L S , U S whenever ∥B∥ F mBÁnB < t, for a given user-defined threshold t. The block pivots in block ILU are inverted exactly by using GE with partial pivoting. In assembling the Schur complement matrix A ℓþ1 at level ℓ, we take advantage of the finest block structure of D ℓ , F ℓ , E ℓ , C ℓ , imposed by the block ordering P B on the small (usually dense) blocks in the diagonal blocks of D ℓ and the corresponding small off-diagonal blocks in E ℓ and F ℓ ; we call optimized level-3 BLAS routines [53] for computing A ℓþ1 in Eq. (36). We do not drop entries in the Schur complement, except at the last level. The same threshold is applied in all these operations.
The VBARMS code is developed in the C language and is adapted from the existing ARMS code available in the ITSOL package [54]. The compressed sparse storage format of ARMS is modified to store block vectors and block matrices of variable size as a collection of contiguous non-zero dense blocks (we refer to this data storage format as VBCSR). First, we compute the factors L ℓ , U ℓ and L À1 ℓ F ℓ by performing a variant of the IKJ version of the Gaussian Elimination algorithm, where index I runs from 2 to m ℓ , index K from 1 to I À 1 ð Þand index J from K þ 1 ð Þ to n ℓ . This loop applies implicitly L À1 ℓ to the block row D ℓ ; F ℓ ½ to produce U ℓ ; L À1 ℓ F ℓ h i . In the second loop, Gaussian Elimination is performed on the block row E ℓ ; C ℓ ½ using the multipliers computed in the first loop to give E ℓ U À1 ℓ and an approximation of the Schur complement A ℓþ1 . Then, after Step 1, we permute explicitly the matrix at the first level as well as the matrices involved in the factorization at each new reordering step. For extensive performance assessment results of the VBARMS method, we point the reader to [50].

Numerical experiments
In this section, we illustrate the performance of the VBARMS method for solving a suite of block structured linear systems arising from an implicit Newton-Krylov formulation of the RANS equations in the turbulent incompressible flow analysis past a three-dimensional wing. On multicore machines, the quotient graph G=B is split into distinct subdomains, and each of them is assigned to a different core. Following the parallel framework described in [55], we separate the nodes assigned to the ith subdomain into interior nodes, that are those coupled by the equations only with the local variables, and interface nodes, those that may be coupled with the local variables stored on processor i as well as with remote variables stored on other processors (see Figure below).

local variables local interface variables external interface variables
The vector of the local unknowns x i and the local right-hand side b i are split accordingly in two separate components: the subvector corresponding to the internal nodes followed by the subvector of the local interface variables The rows of A indexed by the nodes of the ith subdomain are assigned to the ith processor.
These are naturally separated into a local matrix A i acting on the local variables x i ¼ u i ; y i À Á T , and an interface matrix U i acting on the remotely stored subvectors of the external interface variables y i, ext . Hence, we can write the local equations on processor i as 39) or, in expanded form, as where N i is the set of subdomains that are neighbors to subdomain i and the submatrix E ij y j accounts for the contribution to the local equation from the jth neighboring subdomain. Note that matrices B i , C i , E i, and F i still preserve the fine block structure imposed by the block ordering P B . From a code viewpoint, the quotient graph is initially distributed amongst the available processors; then, the built-in parallel hypergraph partitioner available in the Zoltan package [56] is applied on the distributed data structure to compute an optimal partitioning of the quotient graph that can minimize the amount of communications.
At this stage, the VBARMS method described in Section 4.2 can be used as a local solver for different types of global preconditioners. In the simplest parallel implementation, the socalled block-Jacobi preconditioner, the sequential VBARMS method can be applied to invert approximately each local matrix A i . The standard Jacobi iteration for solving Ax ¼ b is defined as where D is the diagonal of A, N ¼ D À A and x 0 is some initial approximation. In cases we have a graph partitioned matrix, the matrix D is block diagonal and the diagonal blocks of D are the local matrices A i . The interest to consider the block Jacobi preconditioner is its inherent parallelism, since the solves with the matrices A i are performed independently on all the processors and no communication is required.
If the diagonal blocks of the matrix D are enlarged in the block-Jacobi method so that they overlap slightly, the resulting preconditioner is called Schwarz preconditioner. Consider again a graph partitioned matrix with N nonoverlapping sets W 0 We define a δ-overlap partition and δ > 0 is the level of overlap with the neighboring domains. For each subdomain, we define a restriction operator R δ i , which is an n Â n matrix with the j; j ð Þth element equal to 1 if j ∈ W δ i , and zero elsewhere. We then denote The global preconditioning matrix M RAS is defined as and named as the Restricted Additive Schwarz (RAS) preconditioner [3,57]. Note that the preconditioning step still offers a good scope par parallelism, as the different components of the error update are formed independently. However, due to overlapping some communication is required in the final update, as the components are added up from each subdomain. In our experiments, the overlap used for RAS was the level 1 neighbors of the local nodes in the quotient graph.
A third global preconditioner that we consider in this study is based on the Schur complement approach. In Eq. (40), we can eliminate the vector of interior unknowns u i from the first equations to compute the local Schur complement system where S i denotes the local Schur complement matrix The local Schur complement equations considered altogether write as the global Schur complement system where the off-diagonal matrices E ij are available from the parallel distribution of the linear system. One preconditioning step with the Schur complement preconditioner consists in solving approximately the global system (47), and then recovering the u i variables from the local equations as at the cost of one local solve. We solve the global system (47) by running a few steps of the GMRES method preconditioned by a block diagonal matrix, where the diagonal blocks are the local Schur complements S i . The factorization is obtained as by-product of the LU factorization of the local matrix A i , which is by the way required to compute the u i variables in Eq. (48).

Results
The parallel experiments were run on the large-memory nodes (32 cores/node and 1 TB of memory) of the TACC Stampede system located at the University of Texas at Austin. TACC Stampede is a 10 PFLOPS (PF) Dell Linux Cluster based on 6400+ Dell PowerEdge server nodes, each outfitted with 2 Intel Xeon E5 (Sandy Bridge) processors and an Intel Xeon Phi Coprocessor (MIC Architecture). We linked the default vendor BLAS library, which is MKL. Although MKL is multi-threaded by default, in our runs we used it in a single-thread mode since our MPI-based parallelisation employed one MPI process per core (communicating via the shared memory for the same-node cores). We used the Flexible GMRES (FGMRES) method [58] as Krylov subspace method, a tolerance of 1:0e À 6 in the stopping criterion and a maximum number of iteration equal to 1000. Memory costs were calculated as the ratio between the sum of the number of non-zeros in the local preconditioners and the sum of the number of non-zeros in the local matrices A i .
In our experiments, we analyzed the turbulent incompressible flow past a three-dimensional wing illustrated in Figure 2 using the EulFS code developed by the second author [59]. The geometry, called DPW3 Wing-1, was proposed in the 3rd AIAA Drag Prediction Workshop [35]. Flow conditions are 0.5 ∘ angle of attack and Reynolds number based on the reference chord equal to 5 Á 10 6 . The freestream turbulent viscosity is set to 10% of its laminar value. In  Table 3. Strong scalability study on the RANS2 problem using parallel graph partitioning.
Tables 1 and 2 we show experiments with the parallel VBARMS solver on the five meshes of the DPW3 Wing-1 problem. On the largest mesh we report on only one experiment, in Table 2, as this is a resource demanding problem. In Table 3 we report on a strong scalability study on the problem denoted as RANS2 by increasing the number of processors. Finally, in Table 4 we show comparative results with parallel VBARMS against other popular solvers; the method denoted as pARMS is the solver described in [55] using default parameters while the method VBILUT is a variable-block incomplete lower-upper factorization with threshold from the ITSOL package [54]. The results of our experiments show that the proposed preconditioner is effective to reduce the number of iterations especially in combination with the Restricted Additive Schwarz method, and exhibits good parallel scalability. A truly parallel implementation of the VBARMS method that may offer better numerical scalability will be considered as the next step of this research.

Conclusions
The applicability of Newton's method in steady flow simulations is often limited by the difficulty to compute a good initial solution, namely, one lying in a reasonably small neighborhood of the sought solution. This problem can now be overcome by introducing some approximations in the first stages of the solution procedure. In the case of unsteady flow problems, on the other hand, the use of Newton's method in conjunction with a dual-time stepping procedure is even more effective since the flow field computed at the preceding physical time level is likely to be sufficiently close to the sought solution at the next time level to allow the use of Newton's algorithm right from the beginning of the sub-iterations in pseudo-time. On the downside of Newton-Krylov methods is the need for efficiently preconditioned iterative algorithms to solve the sparse linear system arising at each inner iteration (Newton step). The stiffness of the linear systems to be solved increases when the Jacobian is computed "exactly" and the turbulence transport equations are solved fully coupled with the mean flow equations.
In this chapter, we have presented a block multilevel incomplete factorization preconditioner for solving sparse systems of linear equations arising from the implicit RANS formulation. The method detects automatically any existing block structure in the matrix, without any user's prior  Table 4. Experiments on the DPW3 Wing-1 problem. The RANS3 test case is solved on 32 processors and the RANS4 problem on 128 processors. The dash symbol À in the table means that in the GMRES iteration the residual norm is very large and the program is aborted.
knowledge of the underlying problem, and exploits it to maximize computational efficiency. The results of this chapter show that, by taking advantage of this block structure, the solver can be more robust and efficient. Other recent studies on block ILU preconditioners have drawn similar conclusions on the importance of exposing dense blocks during the construction of the incomplete LU factorization for better performance, in the design of incomplete multifrontal LUfactorization preconditioners [61] and adaptive blocking approaches for blocked incomplete Cholesky factorization [62]. We believe that the proposed VBARMS method can be useful for solving linear systems also in other areas, such as in Electromagnetics applications [63][64][65].