Open access peer-reviewed chapter

Multilevel Variable-Block Schur-Complement-Based Preconditioning for the Implicit Solution of the Reynolds- Averaged Navier-Stokes Equations Using Unstructured Grids

By Bruno Carpentieri and Aldo Bonfiglioli

Submitted: April 25th 2017Reviewed: October 27th 2017Published: February 14th 2018

DOI: 10.5772/intechopen.72043

Downloaded: 346

Abstract

Implicit methods based on the Newton’s rootfinding algorithm are receiving an increasing attention for the solution of complex Computational Fluid Dynamics (CFD) applications due to their potential to converge in a very small number of iterations. This approach requires fast convergence acceleration techniques in order to compete with other conventional solvers, such as those based on artificial dissipation or upwind schemes, in terms of CPU time. In this chapter, we describe a multilevel variable-block Schur-complement-based preconditioning for the implicit solution of the Reynolds-averaged Navier-Stokes equations using unstructured grids on distributed-memory parallel computers. The proposed solver detects automatically exact or approximate dense structures in the linear system arising from the discretization, and exploits this information to enhance the robustness and improve the scalability of the block factorization. A complete study of the numerical and parallel performance of the solver is presented for the analysis of turbulent Navier-Stokes equations on a suite of three-dimensional test cases.

Keywords

  • computational fluid dynamics
  • Reynolds-averaged Navier-Stokes equations
  • Newton-Krylov methods
  • linear systems
  • sparse matrices
  • algebraic preconditioners
  • incomplete LU factorization
  • multilevel methods

1. Introduction

A considerable number of modern high-fidelity Computational Fluid Dynamics (CFD) solvers and codes still adopt either one-dimensional physical models based on the Riemann problem using higher order shape functions, such as higher order Finite Volume (FV) and Discontinuous Galerkin Finite Element (FE) methods for the discrete data representation, or truly multi-dimensional physical models using linear shape functions, like Fluctuation Splitting (FS) schemes. Both of these approaches require fast convergence acceleration techniques in order to compete with conventional solvers based on artificial dissipation or upwind schemes in terms of CPU time. Implicit methods based on the Newton’s rootfinding algorithm are receiving an increasing attention in this context for the solution of complex real-world CFD applications, for example in the analyses of turbulent flows past three-dimensional wings, due to their potential to converge in a very small number of iterations [1, 2]. In this chapter, we consider convergence acceleration strategies for the implicit solution of the Reynolds-averaged Navier-Stokes (RANS) equations based on the FS space discretization using a preconditioned Newton-Krylov algorithm for the integration. The use of a Newton solver requires the inversion of a large nonsymmetric system of equations at each step of the non-linear solution process. Choice of linear solver and preconditioner is crucial for efficiency especially when the mean flow and the turbulence transport equation are solved in fully coupled form. In this study, we use the restarted Generalized Minimal Residuals (GMRES) [3] algorithm for the inner linear solver, preconditioned by a block multilevel incomplete lower-upper (LU) factorization. We present the development lines of the multilevel preconditioning strategy that is efficient to reduce the number of iterations of Krylov subspace methods at moderate memory cost, and shows good parallel performance on three-dimensional turbulent flow simulations.

The chapter is structured as follows. The governing conservation equations for both compressible and incompressible flows are reviewed in Section 2. Section 3 briefly describes the fluctuation splitting space discretization, the time discretization and the Newton-Krylov method used to solve the space- and time-discretized set of governing partial differential equations (PDEs). In Section 4, we present the development of the multilevel preconditioning strategies for the inner linear solver. We illustrate the numerical and parallel performance of the preconditioner for the analysis of turbulent incompressible flows past a three-dimensional wing in Section 5. Some concluding remarks arising from the study are presented in Section 6.

2. Governing equations

In the case of inviscid and laminar flows, given a control volume Ci, fixed in space and bounded by the control surface Ciwith inward normal n, the governing equations of fluid dynamics are obtained by considering the conservation of mass, momentum and energy. In the case of viscous turbulent flows, one approach to consider the effects of turbulence is to average the unsteady Navier-Stokes (NS) equations on the turbulence time scale. Such averaging procedure results in a new set of steady equations (the RANS equations) that differ from the steady NS equations for the presence of the Reynolds’ stress tensor, representing the effects of turbulence on the averaged flow field. The appearance of this tensor yields a closure problem, which is often solved by adopting an algebraic or a differential turbulence model. In the present work, we use the Spalart-Allmaras [4] one-equation model for the turbulent viscosity. Thus the integral form of the conservation law of mass, momentum, energy and turbulence transport equations has the form

CiUitdV=CinFdSCinGdS+CiSdVE1

where Uis the vector of conserved variables. For compressible flows, we have U=ρρe0ρuν˜T,and for incompressible, constant density flows, U=puν˜T.The operators Fand Grepresent the inviscid and viscous fluxes, respectively; for compressible flows, we have

F=ρuρuh0ρuu+pIν˜u,G=1Re0uτ¯¯+qτ¯¯1PrTν+ν˜ν˜,E2

and for incompressible, constant density flows,

F=a2uuu+pIν˜u,G=1Re0τ¯¯1PrTν+ν˜ν˜.E3

Finally, the source term Shas a non-zero entry only in the row corresponding to the turbulence transport equation; its expression is not reported here for brevity, but can be found in [4]. Note that the standard NS equations are retrieved from (1) by removing the source term Sand the differential equation associated with the turbulence variable, and setting the effective viscosity and thermal conductivity to their laminar values. The Euler equations are instead recovered by additionally removing the flux vector G.

3. Solution techniques

The model used in this study for the discrete data representation is based on the coupling of an hybrid class of methods for the space discretization, called Fluctuation Splitting (or residual distribution) schemes [5], and a fully coupled Newton algorithm. By “fully coupled” we mean that the mass, momentum and energy conservation equations on one hand, and the turbulent equation on the other, are solved simultaneously rather than in a decoupled or staggered fashion. We discuss in the following subsections, separately, the space and time discretization, the numerical integration of the set of equations resulting from the discretization, and the solution of the large linear system at each Newton’s step.

3.1. Space discretisation

The Fluctuation Splitting approach has features common to both Finite Element (FE) and Finite Volume (FV) methods. Like in standard FE methods, the dependent variables are stored at the vertices of the computational mesh made up of triangles in the two-dimensional (2D) space, and tetrahedra in three-dimensional (2D), and are assumed to vary linearly and continuously in space. Denoting Zi as the nodal value of the dependent variable at the grid point iand Nias the FE linear shape function, this dependence can be written as

Zxt=iZitNix.E4

Note that, although the summation in Eq. (4) extends over all grid nodes, the computational molecule of each node is actually limited only to the set of its nearest neighbors due to the compact support of the linear shape functions. In the compressible case, Roe’s parameter vector

Z=ρρh0ρuν˜TE5

is chosen as the dependent variable to ensure discrete conservation [6]. In the incompressible case, discrete conservation is obtained by simply setting the dependent variable Zequal to the vector of conserved variables U. In our code, we group the dependent variables per gridpoint. The first mentries of the array Zare filled with the mflow variables of gridpoint 1, and these are followed by those of gridpoint 2, and so on. Blocking the flow variables in this way, also referred to as “field interlacing” in the literature, is acknowledged [7, 8, 9] to result in better performances than grouping variables per aerodynamic quantity.

The integral Eq. (1) is discretized over each control volume Ciusing a FV-type approach. In two dimensions, the control volumes Ciare drawn around each gridpoint by joining the centroids of gravity of the surrounding cells with the midpoints of all the edges that connect that gridpoint with its nearest neighbors. An example of polygonal-shaped control volumes (so-called median dual cells) is shown by green lines in Figure 1(a). With FS schemes, rather than calculating the inviscid fluxes by numerical quadrature along the boundary Ciof the median dual cell, as would be done with conventional FV schemes, the net inviscid flux Φe,invover each triangular/tetrahedral element

Φe,inv=TenFdSE6

is evaluated by means of a conservative linearization based on the parameter vector [6], and scattered to the element vertices using elemental distribution matrices Bie[5]. The inviscid contribution to the nodal residual RΦiis then assembled by collecting fractions Φie,invof the net inviscid fluxes Φe,invassociated with all the elements by which the node iis surrounded. This is schematically shown in Figure 1(b). Concerning the viscous terms, the corresponding flux balance is evaluated by surface integration along the boundaries of the median dual cell: node ireceives a contribution Φie,visfrom cell e which accounts for the viscous flux through the portion of Cithat belongs to that cell. This approach can be shown to be equivalent to a Galerkin FE discretization.

Figure 1.

Residual distribution concept. (a) The flux balance of cell T is scattered among its vertices. (b) Gridpoint i gathers the fractions of cell residuals from the surrounding cells.

Summing up the inviscid and viscous contributions to the nodal residual of gridpoint i one obtains

RΦi=eiΦie,inv+Φie,vis=eiBieΦe,inv+Φie,vis.E7

In Eq. (7), the summation ranges over all the elements ethat meet in meshpoint i, as shown in Figure 1(b). The construction of the distribution matrices Bieinvolves the solution of d+1small (of order m) dense linear systems for each triangular/tetrahedral element of the mesh thus making FS schemes somewhat more expensive than state-of-the-art FV schemes based upon either central differencing with artificial dissipation or upwind discretizations. The relatively high computational cost of FS discretizations has to be accounted for when deciding whether the Jacobian matrix should be stored in memory or a Jacobian-free method be used instead.

3.2. Time discretisation

One route to achieve second-order time accuracy with FS schemes is to use mass matrices that couple the time derivatives of neighboring grid points. This leads to implicit schemes, even if the spatial residual were treated explicitly. Although a more general framework for the derivation of the mass matrices can be devised [10], the approach adopted in our study consists of formulating the FS scheme as a Petrov-Galerkin FE method with elemental weighting function given by

Ωie=Ni1d+1Im×mBie.E8

The contribution of element eto the weighted residual equation for grid point ireads

TeΩieUtdV=TeΩieUZZtdV,E9

where the chain rule is used to make the dependent variable Zappear. Since the conservative variables Uare quadratic functions of the parameter vector Z, the transformation matrix U/Zis linear in Zand can thus be expanded using the linear shape functions Nj, just as in Eq. (4). A similar expansion applies to the time-derivative Z/t. Replacing both expansions in the RHS of Eq. (9), the discrete counterpart of the time derivative of Eq. (9) is given by the contribution of all elements sharing the node i:

CiUitdV=eTe(ΩieUt)dV=eijeMije(Zt)j.E10

In Eq. (10) the index jspans the vertices of the element eand the nodal values Z/tjare approximated by the three-level Finite Difference (FD) formula

Ztj=3Zjn+14Zjn+Zjn12Δt.E11

The matrix Mijein Eq. (10) is the contribution of element eto the entry in the ith row and jth column of the global mass matrix M. Similarly to what is done in the assembly of the inviscid and viscous flux balance, Eq. (7), the discretization of the unsteady term, Eq. (10), is obtained by collecting elemental contributions from all the elements that surround the node i. Second-order space and time accuracy of the scheme described above has been demonstrated for inviscid flow problems by Campobasso et al. [11] using an exact solution of the Euler equations.

3.3. Numerical integration

Writing down the space- and time-discretized form of Eq. (1) for all gridpoints of the mesh, one obtains the following large, sparse system of non-linear algebraic equations

RgUZ=RΦUZM1Δt32Zn+12Zn+12Zn1=0E12

to be solved at time level n+1to obtain the unknown solution vector Un+1. The solution of Eq. (12) is obtained by means of an implicit approach based on the use of a fictitious time-derivative (Jameson’s dual time-stepping [12]) that amounts to solve the following evolutionary problem

dUVM=RgUE13

in pseudo-time τuntil steady state is reached. Since accuracy in pseudo-time is obviously irrelevant, the mass matrix has been lumped into the diagonal matrix VMand a first-order accurate, two-time levels FD formula

dUUn+1,k+1Un+1,kΔτE14

is used to approximate the pseudo-time derivative in the LHS of Eq. (13). The outer iterations counter khas been introduced in Eq. (14) to label the pseudo-time levels.

Upon replacing Eq. (14) in Eq. (13), an implicit scheme is obtained if the residual Rgis evaluated at the unknown pseudo-time level k+1. Taylor expanding Rgabout time level k, one obtains the following sparse system of linear equations

1ΔτkVMJΔU=RgUn+1,kJ=RgUE15

to be solved at each outer iteration until the required convergence of Rgis obtained. Steady RANS simulations are accommodated within the presented integration scheme by dropping the physical time-derivative term in Eq. (12). In the limit Δτk, Eq. (15) recovers Newton’s rootfinding algorithm, which is known to yield quadratic convergence when the initial guess Un+1,0=Unis sufficiently close to the sought solution Un+1. This is likely to occur when dealing with unsteady flow problems because the solution of the flow field at a given physical time starts from the converged solution at the preceding time, and this latter constitutes a very convenient initial state. In fact, it is sufficiently close to the sought new solution to allow the use of the exact Newton’s method (i.e. Δτk=in Eq. (15)) since the first solution step.

The situation is different when dealing with steady flow problems. Newton’s method is only locally convergent, meaning that it is guaranteed to converge to a solution when the initial approximation is already close enough to the sought solution. This is generally not the case when dealing with steady flows, and a “globalization strategy” needs to be used in order to avoid stall or divergence of the outer iterations. The choice commonly adopted by various authors [13, 14], and in this study as well, is a pseudo-transient continuation, which amounts to retain the pseudo-transient term in Eq. (15). At the early stages of the iterative process, the pseudo-time step length Δτkin Eq. (15) is kept small. The advantage is twofold: on one hand, it helps preventing stall or divergence of the outer iterations; on the other hand, it makes the linear system (15) easier to solve by means of an iterative solver since for moderate values of Δτkthe term Vm/Δτkincreases the diagonal dominance of the matrix. Once the solution has come close to the steady state, which can be monitored by looking at the norm of the nodal residual RΦ, we let Δτkgrow unboundedly so that Newton’s method is eventually recovered during the last steps of the iterative process. The time step length Δτkis selected according to the Switched Evolution Relaxation (SER) strategy proposed by Mulder and van Leer [15], as follows:

Δτk=ΔτminCmaxC0RgUn+1,02RgUn+1,k2,E16

where Δτis the pseudo-time step based upon the stability criterion of the explicit time integration scheme, and C0and Cmaxare user-defined constants controlling the initial and maximum pseudo-time steps used in the actual calculations.

In the early stages of the iterative process, the turbulent transport equation and the mean flow equations are solved in tandem (or in a loosely coupled manner, following the nomenclature used by Zingg et al. [16]): the mean flow solution is advanced over a single pseudo-time step using an analytically computed, but approximate Jacobian while keeping turbulent viscosity frozen, then the turbulent variable is advanced over one or more pseudo-time steps using a FD Jacobian with frozen mean flow variables. Due to the uncoupling between the mean flow and turbulent transport equations, this procedure will eventually converge to steady state, but never yields quadratic convergence. Close to steady state, when a true Newton strategy preceded by a “short” pseudo-transient continuation phase can be adopted, the mean flow and the turbulence transport equation are solved in fully coupled form, and the Jacobian is computed by FD. For the sake of completeness, we give further details of each of these two steps in the following two paragraphs.

3.3.1. Tandem solution strategy with (approximate) Picard linearization

Consider re-writing the steady nodal residual RΦ, see [17] for full details, as

RΦU=CDU,E17

where Cand Dare (sparse) matrices that account for the convective and diffusive contributions to the nodal residual vector RΦ. Matrix Dis constant for isothermal, incompressible flows whereas it depends upon the flow variables through molecular viscosity in the case of compressible flows. Matrix Cdepends upon Ufor both compressible and incompressible flows. Both matrices can be computed analytically as described in [17]. What we refer to as a Picard linearization consists in the following approximation

JCD,E18

which amounts to neglect the dependence of matrices Cand Dupon Uwhen differentiating the residual, written as in Eq. (17).

Once the mean flow solution has been advanced over a single pseudo-time step using the approximate Picard linearization, keeping the turbulent viscosity frozen, the turbulent variable is advanced over one or more (typically ten) pseudo-time steps using a FD Jacobian approximation (described in Section 3.3.2) with frozen mean flow variables. Blanco and Zingg [18] adopt a similar strategy, but keep iterating the turbulence transport equation until its residual has become lower than that of the mean flow equations. The loosely coupled solution strategy is a choice often made as it “allows for the easy interchange of new turbulence models” [19] and also reduces the storage [18], compared to a fully coupled approach. However, due to the uncoupling between the mean flow and the turbulent transport equations, the tandem solution strategy never yields quadratic convergence nor it is always able to drive the nodal residual to machine zero. The last statement cannot be generalized, since Blanco and Zingg [16, 20] report convergence to machine zero for their loosely coupled approach on two-dimensional unstructured grids. However, even if convergence to machine zero is difficult to achieve, the nodal residual is always sufficiently converged for any practical “engineering” purpose and close enough to “true” steady-state solution to be a good initial guess for Newton’s method.

3.3.2. Fully coupled solution strategy with FD Newton linearization

Once the tandem solution strategy has provided a good approximation to the steady flow, or when dealing with unsteady flows, in which case the solution at a given time level is generally a good approximation to the one sought at the next time level, it becomes very attractive to take advantage of the quadratic convergence of Newton’s method. In order to do so, however, the mean flow and the turbulence transport equations must be solved fully coupled and the Jacobian matrix Jmust be accurate. We take advantage of the compactness of the computational stencil required by FS schemes to compute a close approximation to the true Jacobian matrix even for second-order accurate discretizations. The analytical evaluation of the Jacobian matrix, though not impossible [5, 21], is rather cumbersome and thus this approach is not pursued here.

When the equations are fully coupled, the structure of the Jacobian matrix Jis naturally organized into small dense blocks of order m. This has implications both in terms of storage, since it is possible to reduce the length of the integer pointers that define the Compressed Sparse Row (CSR) data structure of the sparse matrix, and also in the design of the preconditioner for solving the large linear system at each Newton step, where division operations can be efficiently replaced by block matrix factorizations. We will address these issues in detail in the next section. Two neighboring gridpoints, iand j, in the mesh will contribute two block entries, Jijand Jji, to the global Jacobian matrix J. Each of these two block entries (say Jij, for instance) will be computed by assembling elemental contributions coming from all the cells that share vertex i, as follows

Jij=eiJije.E19

Eq. (19) follows by applying the sum rule of differentiation and by observing that the nodal residual itself is a sum of contributions from the elements that share vertex i, see Figure 1(b). Specifically, element Jijeaccounts for the contribution of cell eto the residual change at gridpoint i, due to a change in the state vector of a neighboring gridpoint jthat belongs to the same element e. The contribution of cell eto the element pqof the block Jijis computed from the following one-sided FD formula

Ji,jep,q=RgeipUiÛjqRgeUiUjε1p,qm;i,je,E20

where Rgeipis the pthcomponent of the contribution of cell eto the nodal residual of gridpoint i. In Eq. (20) we have emphasized that Rgeionly depends upon the flow state of the d+1vertices of cell e, which include both iand j. The first partial derivative, RgeipUiÛjqis computed by perturbing the qthcomponent of the conserved variables vector at gridpoint jas follows

Ûjq=uj1uj2ujq+εujq…,ujm,E21

where εis a “small” quantity. Due to the use of a one-sided FD formula, the FD approximation (20) of the Jacobian entry is affected by a truncation error which is proportional to the first power of ε. Small values of εkeep the truncation error small, but too small values may lead to round-off errors. Following [21], εis computed as

εx=εmcmaxx1sgnx.E22

From a coding viewpoint, the same loop over all cells used to build the nodal residual Rgis also used to assemble matrix J. The operations to be performed within each cell are the following: i) perturb each of the mcomponents of the conserved variables vector of the d+1vertices of cell e; ii) evaluate the residual contribution to each of the vertices; iii) calculate the Jacobian entries according to Eq. (20). While looping over cell e, this contributes d+12block entries to the global matrix J. Moreover, it follows that the cost of a Jacobian evaluation is equal to m×d+1residual evaluations, which can be quite a large number. For instance, for a 3D compressible RANS calculation using a one-equation turbulence model, m×d+1=24. In this study, it was decided to store the Jacobian matrix in memory rather than using a Jacobian-free (JFNK), as the Jacobian matrix is relatively sparse even for a second-order accurate discretization due to the compactness of the stencil. The JFNK approach avoids assembling and, more important, storing the Jacobian matrix. However, the matrix-vector product is replaced with the Jacobian matrix by FD formulae which requires extra costly FS residual evaluations. Note that the JFNK method still requires the construction of a preconditioner to be used by the iterative linear solver, often an Incomplete Lower Upper factorization, which is typically constructed using a lower order approximation of the residual vector. Matrix-free preconditioners might also be used [22], saving a huge storage at the expense of extra CPU cost. These latter are referred to as MFNK methods. Although JFNK or MFNK approaches should certainly be favored from the viewpoint of memory occupation, it cannot always be “assumed that the Jacobian-free matrix-vector products are inherently advantageous in terms of computing time” [13].

The compactness of the FS stencil, which never extends beyond the set of distance-1 neighbors even for a second-order-accurate space-time discretization, offers two advantages. On one hand, apart from the truncation and round-off errors involved in the FD derivatives, the numerical Jacobian matrix is a close approximation of the analytical Jacobian, even for a second-order-accurate discretization. This feature is crucial for retaining the quadratic convergence properties of Newton’s algorithm. On the other hand, it is considerably sparser than that obtained using more traditional FV discretizations, which typically extend up to distance-2 [1] or even distance-3 neighbors [8]. In these latter cases, contributions from the outermost gridpoints in the stencil have to be neglected [1], or at least lumped [8], when constructing the Jacobian approximation upon which the ILU() preconditioner is built. These approximations are a potential source of performance degradation as reported in [8]. The memory occupation required to store the Jacobian matrix still remains remarkable. Moreover, not only the Jacobian matrix, but also its preconditioner needs to be stored in the computer memory. It is therefore clear that a key ingredient that would help reducing memory occupation is an effective preconditioner having a relatively small number of non-zero entries, as close as possible to that of the Jacobian matrix. This demanding problem is addressed in the next section.

4. Linear solve and preconditioning

The previous discussions have pointed that the solution of the large nonsymmetric sparse linear system (15) at each pseudo-time step is a major computational task of the whole flow simulation, especially when the mean flow and the turbulence transport equations are solved in fully coupled form, the Jacobian is computed exactly by means of FD, and the size of the time-step is rapidly increased to recover Newton’s algorithm. For convenience, we write system (15) in compact form as

Ax=b,E23

where A=aijis the large and sparse coefficient matrix of, say, size n, and bis the right-hand side vector. It is well established that, when Ais highly nonsymmetric and/or indefinite, iterative methods need the assistance of preconditioning to transform system (23) into an equivalent one that is more amenable to an iterative solver. The transformed preconditioned system writes in the form M1Ax=M1bwhen preconditioning is applied from the left, or AM1y=bwith x=M1ywhen preconditioning is applied from the right. The matrix Mis a nonsingular approximation to Acalled the preconditioner matrix. In the coming sections, we describe the development of an effective algebraic preconditioner for the RANS model.

4.1. Multi-elimination ILU factorization preconditioner

Incomplete LU factorization methods (ILUs) are an effective, yet simple, class of preconditioning techniques for solving large linear systems. They write in the form M=L¯U¯, where L¯and U¯are approximations of the Land Ufactors of the standard triangular LU decomposition of A. The incomplete factorization may be computed directly from the Gaussian Elimination (GE) algorithm, by discarding some entries in the Land U factors according to various strategies, see [3]. A stable ILU factorization is proved to exist for arbitrary choices of the sparsity pattern of L¯and U¯only for particular classes of matrices, such as M-matrices [23] and H-matrices with positive diagonal entries [24]. However, many techniques can help improve the quality of the preconditioner on more general problems, such as reordering, scaling, diagonal shifting, pivoting and condition estimators [25, 26, 27, 28]. As a result of this recent development, in the past decade successful experience have been reported using ILU preconditioners in areas that were of exclusive domain of direct solution methods like, in circuits simulation, power system networks, chemical engineering plants modeling, graphs and other problems not governed by PDEs, or in areas where direct methods have been traditionally preferred, such as structural analysis, semiconductor device modeling, computational fluid dynamics (see [29, 30, 31, 32, 33]).

Multi-elimination ILU factorization is a powerful class of ILU preconditioners, which combines the simplicity of ILU techniques with the robustness and high degree of parallelism of domain decomposition methods [34]. It is developed on the idea that, due to sparsity, many unknowns of a linear system are not coupled by an equation (i.e. they are independent) and thus they can be eliminated simultaneously at a given stage of GE. If the, say m, independent unknowns are numbered first, and the other nmunknowns last, the coefficient matrix of the system is permuted in a 2×2 block structure of the form

PAPT=DFEC,E24

where Dis a diagonal matrix of dimension mand Cis a square matrix of dimension nm. In multi-elimination methods, a reduced system is recursively constructed from (24) by computing a block LU factorization of PAPTof the form

DFEC=L0GInm×UW0A1,E25

where Land Uare the triangular factors of the LU factorization of D, A1=CED1Fis the Schur complement with respect to C, Inmis the identity matrix of dimension nm, and we denote G=EU1and W=L1F. The reduction process can be applied another time to the reduced system with A1, and recursively to each consecutively reduced system until the Schur complement is small enough to be solved with a standard method such as a dense LAPACK solver [35]. Multi-elimination ILU factorization preconditioners may be obtained from the decomposition (25) by performing the reduction process inexactly, by dropping small entries in the Schur complement matrix and/or factorizing Dapproximately at each reduction step. These preconditioners exhibit better parallelism than conventional ILU algorithms, due to the recursive factorization. Additionally, for comparable memory usage, they may be significantly more robust especially for solving large problems as the reduced system is typically small and better conditioned compared to the full system.

The factorization (25) defines a general framework which may accommodate for many different methods. An important distinction between various methods is rooted in the choice of the algorithm used to discover sets of independent unknowns. Many of these algorithms are borrowed from graph theory, where such sets are referred to as independent sets. Denoting as G=VEthe adjacency graph of A, where V=v1v2vnis the set of vertices and Ethe set of edges, a vertex independent set S is defined as a subset of Vsuch that

viS,vjS:vivjE.E26

The set Sis maximal if there is no other independent set containing Sstrictly [36]. Independent sets in a graph may be computed by simple greedy algorithms which traverse the vertices in the natural order 1,2,,n, mark each visited vertex vand all of its nearest neighbors connected to vby an edge, and add vand each visited node that is not already marked to the independent set [37]. As an alternative to the greedy algorithm, the nested dissection ordering [38], mesh partitioning, or further information from the set of nested finite element grids of the underlying problem can be used [39, 40, 41].

The multilevel preconditioner considered in our study is the Algebraic Recursive Multilevel Solvers (ARMS) introduced in [25], which uses block independent sets computed by the simple greedy algorithm. Block independent sets are characterized by the property that unknowns of two different sets have no coupling, while unknowns within the same set may be coupled. In this case, the matrix Dappearing in (24) is block diagonal, and may typically consist of large-sized diagonal blocks that are factorized by an ILU factorization with threshold (ILUT [42]) for memory efficiency. In the ARMS implementation described in [25], first the incomplete triangular factors L¯, U¯of Dare computed by one sweep of ILUT, and an approximation W¯to L¯1Fis also computed. In a second loop, an approximation G¯to EU¯1and an approximate Schur complement matrix A¯1are derived. This holds at each reduction level. At the last level, another sweep of ILUT is applied to the (last) reduced system. The blocks W¯and G¯are stored temporarily, and then discarded from the data structure after the Schur complement matrix is computed. Only the incomplete factors of Dat each level, those of the last level Schur matrix, and the permutation arrays are needed for the solving phase. By this implementation, dropping can be performed separately in the matrices L¯, U¯, W¯, G¯, A¯1. This in turns allows to factor Daccurately without incurring additional costs in G¯and W¯, achieving high computational and memory efficiency. Implementation details and careful selection of the parameters are always critical aspects to consider in the design of sparse matrix algorithms. Next, we show how to combine the ARMS method with matrix compression techniques to exploit the block structure of Afor better efficiency.

4.2. The variable-block ARMS factorization

The discretization of the Navier-Stokes equations for turbulent compressible flows assigns five distinct variables to each grid point (density, scaled energy, two components of the scaled velocity, and turbulence transport variable); these reduce to four for incompressible, constant density flows, and to three if additionally the flow is laminar. If the, say , distinct variables associated with the same node are numbered consecutively, the permuted matrix has a sparse block structure with non-zero blocks of size ×. The blocks are usually fully dense, as variables at the same node are mutually coupled. Exploiting any available block structure in the preconditioner design may bring several benefits [43], some of them are explained below:

  1. Memory. A clear advantage is to store the matrix as a collection of blocks using the variable-block compressed sparse row (VBCSR) format, saving column indices and pointers for the block entries.

  2. Stability. On indefinite problems, computing with blocks instead of single elements enables a better control of pivot breakdowns, near singularities, and other possible sources of numerical instabilities. Block ILU solvers may be used instead of pointwise ILU methods.

  3. Complexity. Grouping variables in clusters, the Schur complement is smaller and hopefully the last reduced system is better conditioned and easier to solve.

  4. Efficiency. A full block implementation, based on higher level optimized BLAS as computational kernels, may be designed leading to better flops to memory ratios on modern cache-based computer architectures.

  5. Cache effects. Better cache reuse is possible for block algorithms.

It has been demonstrated that block iterative methods often exhibit faster convergence rate than their pointwise analogues for the solution of many classes of two- and three-dimensional partial differential equations (PDEs) [44, 45, 46]. For this reason, in the case of the simple Poisson’s equation with Dirichlet boundary conditions on a rectangle 01×02discretized uniformly by using n1+2points in the interval 01and n2+2points in 02, it is often convenient to number the interior points by lines from the bottom up in the natural ordering, so that one obtains a n2×n2block tridiagonal matrix with square blocks of size n1×n1; the diagonal blocks are tridiagonal matrices and the off-diagonal blocks are diagonal matrices. For large finite element discretizations, it is common to use substructuring, where each substructure of the physical mesh corresponds to one sparse block of the system. If the domain is highly irregular or the matrix does not correspond to a differential equation, finding the best block partitioning is much less obvious. In this case, graph reordering techniques are worth considering.

The PArameterized BLock Ordering (PABLO) method proposed by O’Neil and Szyld is one of the first block reordering algorithms for sparse matrices [47]. The algorithm selects groups of nodes in the adjacency graph of the coefficient matrix such that the corresponding diagonal blocks are either full or very dense. It has been shown that classical block stationary iterative methods such as block Gauss-Seidel and SOR methods combined with the PABLO ordering require fewer operations than their point analogues for the finite element discretization of a Dirichlet problem on a graded L-shaped region, as well as on the 9-point discretization of the Laplacian operator on a square grid. The complexity of the PABLO algorithm is proportional to the number of nodes and edges in both time and space.

Another useful approach to compute dense blocks in the sparsity pattern of a matrix Ais the method proposed by Ashcraft in [48]. The algorithm searches for sets of rows or columns having the exact same pattern. From a graph viewpoint, it looks for vertices of the adjacency graph VEof Ahaving the same adjacency list. These are also called indistinguishable nodes or cliques. The algorithm assigns a checksum quantity to each vertex, using the function

chku=uwEw,E27

and then sorts the vertices by their checksums. This operation takes E+VlogVtime. If uand vare indistinguishable, then chku=chkv. Therefore, the algorithm examines nodes having the same checksum to see if they are indistinguishable. The ideal checksum function would assign a different value for each different row pattern that occurs but it is not practical because it may quickly lead to huge numbers that may not even be machine-representable. Since the time cost required by Ashcraft’s method is generally negligible relative to the time it takes to solve the system, simple checksum functions such as (27) are used in practice [48].

On the other hand, sparse unstructured matrices may sometimes exhibit approximate dense blocks consisting mostly of non-zero entries, except for a few zeros inside the blocks. By treating these few zeros as non-zero elements, with a little sacrifice of memory, a block ordering may be generated for an iterative solver. Approximate dense blocks in a matrix may be computed by numbering consecutively rows and columns having a similar non-zero structure. However, this would require a new checksum function that preserves the proximity of patterns, in the sense that close patterns would result in close checksum values. Unfortunately, this property does not hold true for Ashcraft’s algorithm in its original form. In [49], Saad proposed to compare angles of rows (or columns) to compute approximate dense structures in a matrix A. Let Cbe the pattern matrix of A, which by definition has the same pattern as Aand the non-zero values are equal to 1. The method proposed by Saad computes the upper triangular part of CCT. Entry ijis the inner product (the cosine value) between row iand row jof Cfor j>i. A parameter τis used to gauge the proximity of row patterns. If the cosine of the angle between rows iand jis smaller than τ, row jis added to the group of row i. For τ=1the method will compute perfectly dense blocks, while for τ<1it may compute larger blocks where some zero entries are padded in the pattern. To speed up the search, it may be convenient to run a first pass with the checksum algorithm to detect rows having an identical pattern, and group them together; then, in a second pass, each non-assigned row is scanned again to determine whether it can be added to an existing group. Two important performance measures to gauge the quality of the block ordering computed are the average block density (av_bd) value, defined as the amount of non-zeros in the matrix divided by the amount of elements in the non-zero blocks, and the average block size (av_bs) value, which is the ratio between the sum of dimensions of the square diagonal blocks divided by the number of diagonal blocks. The cost of Saad’s method is closer to that of checksum-based methods for cases in which a good blocking already exists, and in most cases it remains inferior to the cost of the least expensive block LU factorization, i.e. block ILU(0).

Our recently developed variable-block variant of the ARMS method (VBARMS) incorporates an angle-based compression technique during the factorization to detect fine-grained dense structures in the linear system automatically, without any users knowledge of the underlying problem, and exploits them to improve the overall robustness and throughput of the basic multilevel algorithm [50]. It is simpler to describe VBARMS from a graph point of view. Suppose to permute Ain block form as

A˜PBAPBT=A˜11A˜12A˜1pA˜21A˜22A˜2pA˜p1A˜p2A˜pp,E28

where the diagonal blocks A˜ii, i=1,,pare ni×niand the off-diagonal blocks A˜ijare ni×nj. We use upper case letters to denote matrix sub-blocks and lower case letters for individual matrix entries. We may represent the adjacency graph of A˜by the quotient graph of A+AT[36]. Calling Bthe partition into blocks given by (28), we denote as G/B=VBEBthe quotient graph obtained by coalescing the vertices assigned to the block A˜ii(for i=1,,p) into a supervertex Yi. In other words, the entry in position ijof A˜is a block of dimension Yi×Yj, where Xis the cardinality of the set X. With this notation, the quotient graph G/B=VBEBis defined as

VB=Y1Yp,EB=YiYjvYiwYjs.t.vwE.E29

An edge connects two supervertices Yiand Yjif there exists an edge from a vertex in Aiito a vertex in Ajjin the graph VEof A+AT.

The complete pre-processing and factorization process of VBARMS consists of the following steps.

Step 1. Find the block ordering PBof Asuch that, upon permutation, the matrix PBAPBThas fairly dense non-zero blocks. We use the angle-based graph compression algorithm proposed by Saad and described earlier to compute exact or approximate block structures in A.

Step 2. Scale the matrix at Step 1 in the form S1PBAPBTS2using two diagonal matrices S1and S2, so that the 1-norm of the largest entry in each row and column is smaller or equal than 1.

Step 3. Find the block independent sets ordering PIof the quotient graph G/B=VBEB. Apply the permutation to the matrix obtained at Step 2 as

PIS1PBAPBTS2PIT=DFEC.E30

We use a simple form of weighted greedy algorithm for computing the ordering PI. The algorithm is the same as the one used in ARMS, and described in [25]. It consists of traversing the vertices G/Bin the natural order 1,2,,n, marking each visited vertex vand all of its nearest neighbors connected to vby an edge and adding vand each visited node that is not already marked to the independent set. We assign the weight YFto each supervertex Y.

In the 2×2partitioning (30), the upper left-most matrix Dis block diagonal like in ARMS. However, due to the block permutation, the diagonal blocks of Dare additionally block sparse matrices, as opposed to simply sparse matrices in ARMS and in other forms of multilevel incomplete LU factorizations, see [51, 52]. The matrices F, E, Care also block sparse because of the same reason.

Step 4. Factorize the matrix (30) in the form

DFEC=L0EU1I×UL1F0A1,E31

where Iis the identity matrix of appropriate size, and form the reduced system with the Schur complement

A1=CED1F.E32

The Schur complement is also block sparse and has the same block partitioning of C.

Steps 2–4 can be repeated on the reduced system a few times until the Schur complement is small enough. After one additional level, we obtain

E33

that can be factored as

E34

Denote as Athe reduced Schur complement matrix at level , for >1. After scaling and preordering A, a system with the matrix

PID1AD2PIT=DFEC=L0EU1I×UL1F0A+1E35

needs to be solved, with

A+1=CED1F.E36

Calling

x=yz,b=fgE37

the unknown solution vector and the right-hand side vector of system (35), the solution process with the above multilevel VBARMS factorization consists of level-by-level forward elimination followed by an exact solution on the last reduced system and suitable inverse permutation. The solving phase is sketched in Algorithm 1.

In VBARMS, we perform the factorization approximately, for memory efficiency. We use block ILU factorization with threshold to invert inexactly both the upper leftmost matrix DL¯U¯at each level , and the last level Schur complement matrix AmaxL¯SU¯S. The block ILU method used in VBARMS is a straightforward block variant of the one-level pointwise ILUT algorithm. We drop small blocks BRmB×nBin L¯, U¯, L¯S, U¯Swhenever BFmBnB<t, for a given user-defined threshold t. The block pivots in block ILU are inverted exactly by using GE with partial pivoting. In assembling the Schur complement matrix A+1at level , we take advantage of the finest block structure of D, F, E, C, imposed by the block ordering PBon the small (usually dense) blocks in the diagonal blocks of Dand the corresponding small off-diagonal blocks in Eand F; we call optimized level-3 BLAS routines [53] for computing A+1in Eq. (36). We do not drop entries in the Schur complement, except at the last level. The same threshold is applied in all these operations.

The VBARMS code is developed in the C language and is adapted from the existing ARMS code available in the ITSOL package [54]. The compressed sparse storage format of ARMS is modified to store block vectors and block matrices of variable size as a collection of contiguous non-zero dense blocks (we refer to this data storage format as VBCSR). First, we compute the factors L¯, U¯and L¯1Fby performing a variant of the IKJ version of the Gaussian Elimination algorithm, where index Iruns from 2to m, index Kfrom 1to I1and index Jfrom K+1to n. This loop applies implicitly L¯1to the block row DFto produce UL¯1F. In the second loop, Gaussian Elimination is performed on the block row ECusing the multipliers computed in the first loop to give EU¯1and an approximation of the Schur complement A+1. Then, after Step 1, we permute explicitly the matrix at the first level as well as the matrices involved in the factorization at each new reordering step. For extensive performance assessment results of the VBARMS method, we point the reader to [50].

Algorithm 1VBARMS_Solve(A+1,b). The solving phase with the VBARMS method.

Require: N, maxN, b=fgT

1: Solve Ly=f

2: Compute g=gEU1y

3: if =maxthen

4:  Solve A+1z=g

5: else

6:  CallVBARMS_Solve(A+1,g)

7: end if

8: Solve Uy=yL1Fz

5. Numerical experiments

In this section, we illustrate the performance of the VBARMS method for solving a suite of block structured linear systems arising from an implicit Newton-Krylov formulation of the RANS equations in the turbulent incompressible flow analysis past a three-dimensional wing. On multicore machines, the quotient graph G/Bis split into distinct subdomains, and each of them is assigned to a different core. Following the parallel framework described in [55], we separate the nodes assigned to the ith subdomain into interior nodes, that are those coupled by the equations only with the local variables, and interface nodes, those that may be coupled with the local variables stored on processor ias well as with remote variables stored on other processors (see Figure below).

The vector of the local unknowns xiand the local right-hand side biare split accordingly in two separate components: the subvector corresponding to the internal nodes followed by the subvector of the local interface variables

xi=uiyi,bi=figi.E38

The rows of Aindexed by the nodes of the ith subdomain are assigned to the ith processor. These are naturally separated into a local matrix Aiacting on the local variables xi=uiyiT, and an interface matrix Uiacting on the remotely stored subvectors of the external interface variables yi,ext. Hence, we can write the local equations on processor ias

Aixi+Ui,extyi,ext=biE39

or, in expanded form, as

BiFiEiCiuiyi+0jNiEijyj=figi,E40

where Niis the set of subdomains that are neighbors to subdomain iand the submatrix Eijyjaccounts for the contribution to the local equation from the jth neighboring subdomain. Note that matrices Bi, Ci, Ei, and Fistill preserve the fine block structure imposed by the block ordering PB. From a code viewpoint, the quotient graph is initially distributed amongst the available processors; then, the built-in parallel hypergraph partitioner available in the Zoltan package [56] is applied on the distributed data structure to compute an optimal partitioning of the quotient graph that can minimize the amount of communications.

At this stage, the VBARMS method described in Section 4.2 can be used as a local solver for different types of global preconditioners. In the simplest parallel implementation, the so-called block-Jacobi preconditioner, the sequential VBARMS method can be applied to invert approximately each local matrix Ai. The standard Jacobi iteration for solving Ax=bis defined as

xn+1=xn+D1bAxn=D1Nxn+b,E41

where Dis the diagonal of A, N=DAand x0is some initial approximation. In cases we have a graph partitioned matrix, the matrix Dis block diagonal and the diagonal blocks of Dare the local matrices Ai. The interest to consider the block Jacobi preconditioner is its inherent parallelism, since the solves with the matrices Aiare performed independently on all the processors and no communication is required.

If the diagonal blocks of the matrix Dare enlarged in the block-Jacobi method so that they overlap slightly, the resulting preconditioner is called Schwarz preconditioner. Consider again a graph partitioned matrix with Nnonoverlapping sets Wi0, i=1,,Nand W0=i=1NW0i. We define a δ-overlap partition

Wδ=i=1NWiδE42

where Wiδ=adjWiδ1and δ>0is the level of overlap with the neighboring domains. For each subdomain, we define a restriction operator Riδ, which is an n×nmatrix with the jjth element equal to 1if jWiδ, and zero elsewhere. We then denote

Ai=RiδARiδ.E43

The global preconditioning matrix MRASis defined as

MRAS1=i=1sRiTAi1RiE44

and named as the Restricted Additive Schwarz (RAS) preconditioner [3, 57]. Note that the preconditioning step still offers a good scope par parallelism, as the different components of the error update are formed independently. However, due to overlapping some communication is required in the final update, as the components are added up from each subdomain. In our experiments, the overlap used for RAS was the level 1 neighbors of the local nodes in the quotient graph.

A third global preconditioner that we consider in this study is based on the Schur complement approach. In Eq. (40), we can eliminate the vector of interior unknowns uifrom the first equations to compute the local Schur complement system

Siyi+jNiEijyj=giEiBi1figi,E45

where Sidenotes the local Schur complement matrix

Si=CiEiBi1Fi.E46

The local Schur complement equations considered altogether write as the global Schur complement system

S1E12E1pE21S2E2pEp1Ep1,2Spy1y2yp=g1g2gp,E47

where the off-diagonal matrices Eijare available from the parallel distribution of the linear system. One preconditioning step with the Schur complement preconditioner consists in solving approximately the global system (47), and then recovering the uivariables from the local equations as

ui=Bi1fiFiyiE48

at the cost of one local solve. We solve the global system (47) by running a few steps of the GMRES method preconditioned by a block diagonal matrix, where the diagonal blocks are the local Schur complements Si. The factorization

Si=LSiUSiE49

is obtained as by-product of the LU factorization of the local matrix Ai,

Ai=LBi0EiUBi1LSiUBiLBi1Fi0USiE50

which is by the way required to compute the uivariables in Eq. (48).

5.1. Results

The parallel experiments were run on the large-memory nodes (32 cores/node and 1 TB of memory) of the TACC Stampede system located at the University of Texas at Austin. TACC Stampede is a 10 PFLOPS (PF) Dell Linux Cluster based on 6400+ Dell PowerEdge server nodes, each outfitted with 2 Intel Xeon E5 (Sandy Bridge) processors and an Intel Xeon Phi Coprocessor (MIC Architecture). We linked the default vendor BLAS library, which is MKL. Although MKL is multi-threaded by default, in our runs we used it in a single-thread mode since our MPI-based parallelisation employed one MPI process per core (communicating via the shared memory for the same-node cores). We used the Flexible GMRES (FGMRES) method [58] as Krylov subspace method, a tolerance of 1.0e6in the stopping criterion and a maximum number of iteration equal to 1000. Memory costs were calculated as the ratio between the sum of the number of non-zeros in the local preconditioners and the sum of the number of non-zeros in the local matrices Ai.

In our experiments, we analyzed the turbulent incompressible flow past a three-dimensional wing illustrated in Figure 2 using the EulFS code developed by the second author [59]. The geometry, called DPW3 Wing-1, was proposed in the 3rd AIAA Drag Prediction Workshop [35]. Flow conditions are 0.5angle of attack and Reynolds number based on the reference chord equal to 5106. The freestream turbulent viscosity is set to 10% of its laminar value. In Tables 1 and 2 we show experiments with the parallel VBARMS solver on the five meshes of the DPW3 Wing-1 problem. On the largest mesh we report on only one experiment, in Table 2, as this is a resource demanding problem. In Table 3 we report on a strong scalability study on the problem denoted as RANS2 by increasing the number of processors. Finally, in Table 4 we show comparative results with parallel VBARMS against other popular solvers; the method denoted as pARMS is the solver described in [55] using default parameters while the method VBILUT is a variable-block incomplete lower-upper factorization with threshold from the ITSOL package [54]. The results of our experiments show that the proposed preconditioner is effective to reduce the number of iterations especially in combination with the Restricted Additive Schwarz method, and exhibits good parallel scalability. A truly parallel implementation of the VBARMS method that may offer better numerical scalability will be considered as the next step of this research.

Figure 2.

Geometry and mesh characteristics of the DPW3 Wing-1 problem proposed in the 3rd AIAA drag prediction workshop. Note that problems RANS1 and RANS2 correspond to the same mesh, and are generated at two different Newton steps.

MatrixMethodGraph time (s)Factorization time (s)Solving time (s)Total time (s)ItsMem
BJ + VBARMS17.38.5841.5450.13342.98
RANS1RAS + VBARMS17.410.0842.2852.37193.06
SCHUR + VBARMS17.611.9455.9967.93352.57
BJ + VBARMS17.016.7270.1486.86474.35
RANS2RAS + VBARMS16.821.6580.24101.89394.49
SCHUR + VBARMS17.5168.85173.54342.39246.47
BJ + VBARMS27.299.41187.95287.361544.40
RANS3RAS + VBARMS25.2119.3290.47209.79714.48
SCHUR + VBARMS22.052.65721.67774.311404.39

Table 1.

Experiments on the DPW3 Wing-1 problem. The RANS1, RANS2 and RANS3 test cases are solved on 32 processors. We ran one MPI process per core, so in these experiments we used shared memory on a single node.

MatrixMethodGraph time (s)Factorization time (s)Solving time (s)Total time (s)ItsMem
BJ + VBARMS51.512.05105.89117.942233.91
RANS4RAS + VBARMS43.914.0591.53105.581434.12
SCHUR + VBARMS39.315.14289.89305.031793.76
RANS5RAS + VBARMS1203.94(1)16.80274.62291.422354.05

Table 2.

Experiments on the DPW3 Wing-1 problem. The RANS4 and RANS5 test cases are solved on 128 processors. Note (1): due to a persistent problem with the Zoltan library on this run, we report on the result of our experiment with the metis (sequential) graph partitioner [60].

SolverNumber of processorsGraph time (s)Total time (s)ItsMem
838.9388.37275.70
1628.0219.48355.22
RAS + VBARMS3217.0101.49394.49
6416.054.19473.91
12818.228.59553.39

Table 3.

Strong scalability study on the RANS2 problem using parallel graph partitioning.

MatrixMethodFactorization time (s)Solving time (s)Total time (s)ItsMem
pARMS6.63
RANS3BJ + VBARMS99.41187.95287.361544.40
BJ + VBILUT20.458997.829018.2797913.81
pARMS5.38
RANS4BJ + VBARMS12.05105.89117.942233.91
BJ + VBILUT1.16295.20296.354725.26

Table 4.

Experiments on the DPW3 Wing-1 problem. The RANS3 test case is solved on 32 processors and the RANS4 problem on 128 processors. The dash symbol in the table means that in the GMRES iteration the residual norm is very large and the program is aborted.

6. Conclusions

The applicability of Newton’s method in steady flow simulations is often limited by the difficulty to compute a good initial solution, namely, one lying in a reasonably small neighborhood of the sought solution. This problem can now be overcome by introducing some approximations in the first stages of the solution procedure. In the case of unsteady flow problems, on the other hand, the use of Newton’s method in conjunction with a dual-time stepping procedure is even more effective since the flow field computed at the preceding physical time level is likely to be sufficiently close to the sought solution at the next time level to allow the use of Newton’s algorithm right from the beginning of the sub-iterations in pseudo-time. On the downside of Newton-Krylov methods is the need for efficiently preconditioned iterative algorithms to solve the sparse linear system arising at each inner iteration (Newton step). The stiffness of the linear systems to be solved increases when the Jacobian is computed “exactly” and the turbulence transport equations are solved fully coupled with the mean flow equations.

In this chapter, we have presented a block multilevel incomplete factorization preconditioner for solving sparse systems of linear equations arising from the implicit RANS formulation. The method detects automatically any existing block structure in the matrix, without any user’s prior knowledge of the underlying problem, and exploits it to maximize computational efficiency. The results of this chapter show that, by taking advantage of this block structure, the solver can be more robust and efficient. Other recent studies on block ILU preconditioners have drawn similar conclusions on the importance of exposing dense blocks during the construction of the incomplete LU factorization for better performance, in the design of incomplete multifrontal LU-factorization preconditioners [61] and adaptive blocking approaches for blocked incomplete Cholesky factorization [62]. We believe that the proposed VBARMS method can be useful for solving linear systems also in other areas, such as in Electromagnetics applications [63, 64, 65].

Acknowledgments

The authors acknowledge the Texas Advanced Computing Center (TACC) at the University of Texas at Austin for providing HPC resources that have contributed to the research results reported in this chapter. URL:http://www.tacc.utexas.edu. The authors are grateful to Prof. Masha Sosonkina at the Department of Modeling Simulation & Visualization Engineering, Old Dominion University for help and insightful comments.

Nomenclature

aartificial sound speed
ddimension of the space, d=2,3
e0specific total energy
h0specific total enthalpy
level of fill in incomplete lower upper factorizations
mnumber of degrees of freedom within a gridpoint
norder of a matrix
nnznumber of non-zero entries in a sparse matrix
nunit inward normal to the control surface
pstatic pressure
qflux vector due to heat conduction
ttime
uvelocity vector
xvector of the d Cartesian coordinates
Cimedian dual cell (control volume)
∂Ciboundary of the median dual cell (control surface)
Eedges of a graph
GAgraph of matrix A
Mglobal mass matrix
MLleft preconditioning matrix
Mamach number
Iidentity matrix
Nishape function
Ppermutation matrix
PrTturbulent Prandtl number
spatial residual vector
ReReynolds’ number
Ttriangle or tetrahedron
Uconserved variables vector
Vvertices of a graph
VMlumped mass matrix
zparameter vector
αangle of attack
Δtphysical time step
Δτpseudo-time step
ΔU=Un+1,k+1−Un+1,k
εmcmachine zero
ρdensity
νkinematic viscosity
ν˜working variable in the turbulence transport equation
τpseudo-time variable
τ¯¯Newtonian stress tensor
Φflux balance
ΩiePetrov Galerkin weighting function
inodal index or row index of a matrix
jnodal index or column index of a matrix
ecell index
Free-stream condition
invinviscid
kinner iterations counter
nphysical time step counter
Ttranspose
visviscous

How to cite and reference

Link to this chapter Copy to clipboard

Cite this chapter Copy to clipboard

Bruno Carpentieri and Aldo Bonfiglioli (February 14th 2018). Multilevel Variable-Block Schur-Complement-Based Preconditioning for the Implicit Solution of the Reynolds- Averaged Navier-Stokes Equations Using Unstructured Grids, Computational Fluid Dynamics - Basic Instruments and Applications in Science, Adela Ionescu, IntechOpen, DOI: 10.5772/intechopen.72043. Available from:

chapter statistics

346total chapter downloads

More statistics for editors and authors

Login to your personal dashboard for more detailed statistics on your publications.

Access personal reporting

Related Content

This Book

Next chapter

Free-Surface Flow Simulations with Smoothed Particle Hydrodynamics Method using High-Performance Computing

By Corrado Altomare, Giacomo Viccione, Bonaventura Tagliafierro, Vittorio Bovolin, José Manuel Domínguez and Alejandro Jacobo Cabrera Crespo

Related Book

First chapter

Monte Carlo Simulations in NDT

By Frank Sukowski and Norman Uhlmann

We are IntechOpen, the world's leading publisher of Open Access books. Built by scientists, for scientists. Our readership spans scientists, professors, researchers, librarians, and students, as well as business professionals. We share our knowledge and peer-reveiwed research papers with libraries, scientific and engineering societies, and also work with corporate R&D departments and government entities.

More about us