The previous discussions have pointed that the solution of the large nonsymmetric sparse linear system (15) at each pseudo-time step is a major computational task of the whole flow simulation, especially when the mean flow and the turbulence transport equations are solved in fully coupled form, the Jacobian is computed *exactly* by means of FD, and the size of the time-step is rapidly increased to recover Newton’s algorithm. For convenience, we write system (15) in compact form as

### 4.1. Multi-elimination ILU factorization preconditioner

Incomplete LU factorization methods (ILUs) are an effective, yet simple, class of preconditioning techniques for solving large linear systems. They write in the form M=L¯U¯, where L¯and U¯are approximations of the Land Ufactors of the standard triangular LU decomposition of A. The incomplete factorization may be computed directly from the Gaussian Elimination (GE) algorithm, by discarding some entries in the Land *U* factors according to various strategies, see [3]. A stable ILU factorization is proved to exist for arbitrary choices of the sparsity pattern of L¯and U¯only for particular classes of matrices, such as M-matrices [23] and H-matrices with positive diagonal entries [24]. However, many techniques can help improve the quality of the preconditioner on more general problems, such as reordering, scaling, diagonal shifting, pivoting and condition estimators [25, 26, 27, 28]. As a result of this recent development, in the past decade successful experience have been reported using ILU preconditioners in areas that were of exclusive domain of direct solution methods like, in circuits simulation, power system networks, chemical engineering plants modeling, graphs and other problems not governed by PDEs, or in areas where direct methods have been traditionally preferred, such as structural analysis, semiconductor device modeling, computational fluid dynamics (see [29, 30, 31, 32, 33]).

Multi-elimination ILU factorization is a powerful class of ILU preconditioners, which combines the simplicity of ILU techniques with the robustness and high degree of parallelism of domain decomposition methods [34]. It is developed on the idea that, due to sparsity, many unknowns of a linear system are not coupled by an equation (i.e. they are *independent*) and thus they can be eliminated simultaneously at a given stage of GE. If the, say m, independent unknowns are numbered first, and the other n−munknowns last, the coefficient matrix of the system is permuted in a 2×2 block structure of the form

PAPT=DFEC,E24

where Dis a diagonal matrix of dimension mand Cis a square matrix of dimension n−m. In multi-elimination methods, a reduced system is recursively constructed from (24) by computing a block LU factorization of PAPTof the form

DFEC=L0GIn−m×UW0A1,E25

where Land Uare the triangular factors of the LU factorization of D, A1=C−ED−1Fis the Schur complement with respect to C, In−mis the identity matrix of dimension n−m, and we denote G=EU−1and W=L−1F. The reduction process can be applied another time to the reduced system with A1, and recursively to each consecutively reduced system until the Schur complement is small enough to be solved with a standard method such as a dense LAPACK solver [35]. Multi-elimination ILU factorization preconditioners may be obtained from the decomposition (25) by performing the reduction process inexactly, by dropping small entries in the Schur complement matrix and/or factorizing Dapproximately at each reduction step. These preconditioners exhibit better parallelism than conventional ILU algorithms, due to the recursive factorization. Additionally, for comparable memory usage, they may be significantly more robust especially for solving large problems as the reduced system is typically small and better conditioned compared to the full system.

The factorization (25) defines a general framework which may accommodate for many different methods. An important distinction between various methods is rooted in the choice of the algorithm used to discover sets of independent unknowns. Many of these algorithms are borrowed from graph theory, where such sets are referred to as *independent sets*. Denoting as G=VEthe adjacency graph of A, where V=v1v2…vnis the set of vertices and Ethe set of edges, a *vertex independent set* S is defined as a subset of Vsuch that

∀vi∈S,∀vj∈S:vivj∉E.E26

The set Sis maximal if there is no other independent set containing Sstrictly [36]. Independent sets in a graph may be computed by simple greedy algorithms which traverse the vertices in the natural order 1,2,…,n, mark each visited vertex vand all of its nearest neighbors connected to vby an edge, and add vand each visited node that is not already marked to the independent set [37]. As an alternative to the greedy algorithm, the nested dissection ordering [38], mesh partitioning, or further information from the set of nested finite element grids of the underlying problem can be used [39, 40, 41].

The multilevel preconditioner considered in our study is the Algebraic Recursive Multilevel Solvers (ARMS) introduced in [25], which uses block independent sets computed by the simple greedy algorithm. Block independent sets are characterized by the property that unknowns of two different sets have no coupling, while unknowns within the same set may be coupled. In this case, the matrix Dappearing in (24) is block diagonal, and may typically consist of large-sized diagonal blocks that are factorized by an ILU factorization with threshold (ILUT [42]) for memory efficiency. In the ARMS implementation described in [25], first the incomplete triangular factors L¯, U¯of Dare computed by one sweep of ILUT, and an approximation W¯to L¯−1Fis also computed. In a second loop, an approximation G¯to EU¯−1and an approximate Schur complement matrix A¯1are derived. This holds at each reduction level. At the last level, another sweep of ILUT is applied to the (last) reduced system. The blocks W¯and G¯are stored temporarily, and then discarded from the data structure after the Schur complement matrix is computed. Only the incomplete factors of Dat each level, those of the last level Schur matrix, and the permutation arrays are needed for the solving phase. By this implementation, dropping can be performed separately in the matrices L¯, U¯, W¯, G¯, A¯1. This in turns allows to factor Daccurately without incurring additional costs in G¯and W¯, achieving high computational and memory efficiency. Implementation details and careful selection of the parameters are always critical aspects to consider in the design of sparse matrix algorithms. Next, we show how to combine the ARMS method with matrix compression techniques to exploit the block structure of Afor better efficiency.

### 4.2. The variable-block ARMS factorization

The discretization of the Navier-Stokes equations for turbulent compressible flows assigns five distinct variables to each grid point (density, scaled energy, two components of the scaled velocity, and turbulence transport variable); these reduce to four for incompressible, constant density flows, and to three if additionally the flow is laminar. If the, say ℓ, distinct variables associated with the same node are numbered consecutively, the permuted matrix has a sparse block structure with non-zero blocks of size ℓ×ℓ. The blocks are usually fully dense, as variables at the same node are mutually coupled. Exploiting any available block structure in the preconditioner design may bring several benefits [43], some of them are explained below:

*Memory.* A clear advantage is to store the matrix as a collection of blocks using the variable-block compressed sparse row (VBCSR) format, saving column indices and pointers for the block entries.

*Stability.* On indefinite problems, computing with blocks instead of single elements enables a better control of pivot breakdowns, near singularities, and other possible sources of numerical instabilities. Block ILU solvers may be used instead of pointwise ILU methods.

*Complexity.* Grouping variables in clusters, the Schur complement is smaller and hopefully the last reduced system is better conditioned and easier to solve.

*Efficiency.* A full block implementation, based on higher level optimized BLAS as computational kernels, may be designed leading to better flops to memory ratios on modern cache-based computer architectures.

*Cache effects.* Better cache reuse is possible for block algorithms.

It has been demonstrated that block iterative methods often exhibit faster convergence rate than their pointwise analogues for the solution of many classes of two- and three-dimensional partial differential equations (PDEs) [44, 45, 46]. For this reason, in the case of the simple Poisson’s equation with Dirichlet boundary conditions on a rectangle 0ℓ1×0ℓ2discretized uniformly by using n1+2points in the interval 0ℓ1and n2+2points in 0ℓ2, it is often convenient to number the interior points by lines from the bottom up in the natural ordering, so that one obtains a n2×n2block tridiagonal matrix with square blocks of size n1×n1; the diagonal blocks are tridiagonal matrices and the off-diagonal blocks are diagonal matrices. For large finite element discretizations, it is common to use substructuring, where each substructure of the physical mesh corresponds to one sparse block of the system. If the domain is highly irregular or the matrix does not correspond to a differential equation, finding the best block partitioning is much less obvious. In this case, graph reordering techniques are worth considering.

The PArameterized BLock Ordering (PABLO) method proposed by O’Neil and Szyld is one of the first block reordering algorithms for sparse matrices [47]. The algorithm selects groups of nodes in the adjacency graph of the coefficient matrix such that the corresponding diagonal blocks are either full or very dense. It has been shown that classical block stationary iterative methods such as block Gauss-Seidel and SOR methods combined with the PABLO ordering require fewer operations than their point analogues for the finite element discretization of a Dirichlet problem on a graded L-shaped region, as well as on the 9-point discretization of the Laplacian operator on a square grid. The complexity of the PABLO algorithm is proportional to the number of nodes and edges in both time and space.

Another useful approach to compute dense blocks in the sparsity pattern of a matrix Ais the method proposed by Ashcraft in [48]. The algorithm searches for sets of rows or columns having the exact same pattern. From a graph viewpoint, it looks for vertices of the adjacency graph VEof Ahaving the same adjacency list. These are also called *indistinguishable nodes* or *cliques*. The algorithm assigns a *checksum* quantity to each vertex, using the function

chku=∑uw∈Ew,E27

and then sorts the vertices by their checksums. This operation takes ∣E∣+∣V∣log∣V∣time. If uand vare indistinguishable, then chku=chkv. Therefore, the algorithm examines nodes having the same checksum to see if they are indistinguishable. The ideal checksum function would assign a different value for each different row pattern that occurs but it is not practical because it may quickly lead to huge numbers that may not even be machine-representable. Since the time cost required by Ashcraft’s method is generally negligible relative to the time it takes to solve the system, simple checksum functions such as (27) are used in practice [48].

On the other hand, sparse unstructured matrices may sometimes exhibit *approximate dense blocks* consisting mostly of non-zero entries, except for a few zeros inside the blocks. By treating these few zeros as non-zero elements, with a little sacrifice of memory, a block ordering may be generated for an iterative solver. Approximate dense blocks in a matrix may be computed by numbering consecutively rows and columns having a similar non-zero structure. However, this would require a new checksum function that preserves the proximity of patterns, in the sense that close patterns would result in close checksum values. Unfortunately, this property does not hold true for Ashcraft’s algorithm in its original form. In [49], Saad proposed to compare angles of rows (or columns) to compute approximate dense structures in a matrix A. Let Cbe the pattern matrix of A, which by definition has the same pattern as Aand the non-zero values are equal to 1. The method proposed by Saad computes the upper triangular part of CCT. Entry ijis the inner product (the cosine value) between row iand row jof Cfor j>i. A parameter τis used to gauge the proximity of row patterns. If the cosine of the angle between rows iand jis smaller than τ, row jis added to the group of row i. For τ=1the method will compute perfectly dense blocks, while for τ<1it may compute larger blocks where some zero entries are padded in the pattern. To speed up the search, it may be convenient to run a first pass with the checksum algorithm to detect rows having an identical pattern, and group them together; then, in a second pass, each non-assigned row is scanned again to determine whether it can be added to an existing group. Two important performance measures to gauge the quality of the block ordering computed are the *average block density* (av_bd) value, defined as the amount of non-zeros in the matrix divided by the amount of elements in the non-zero blocks, and the *average block size* (av_bs) value, which is the ratio between the sum of dimensions of the square diagonal blocks divided by the number of diagonal blocks. The cost of Saad’s method is closer to that of checksum-based methods for cases in which a good blocking already exists, and in most cases it remains inferior to the cost of the least expensive block LU factorization, i.e. block ILU(0).

Our recently developed variable-block variant of the ARMS method (VBARMS) incorporates an angle-based compression technique during the factorization to detect fine-grained dense structures in the linear system automatically, without any users knowledge of the underlying problem, and exploits them to improve the overall robustness and throughput of the basic multilevel algorithm [50]. It is simpler to describe VBARMS from a graph point of view. Suppose to permute Ain block form as

A˜≈PBAPBT=A˜11A˜12⋯A˜1pA˜21A˜22⋯A˜2p⋮⋮⋱⋮A˜p1A˜p2⋯A˜pp,E28

where the diagonal blocks A˜ii, i=1,…,pare ni×niand the off-diagonal blocks A˜ijare ni×nj. We use upper case letters to denote matrix sub-blocks and lower case letters for individual matrix entries. We may represent the adjacency graph of A˜by the quotient graph of A+AT[36]. Calling Bthe partition into blocks given by (28), we denote as G/B=VBEBthe quotient graph obtained by coalescing the vertices assigned to the block A˜ii(for i=1,…,p) into a supervertex Yi. In other words, the entry in position ijof A˜is a block of dimension ∣Yi∣×∣Yj∣, where ∣X∣is the cardinality of the set X. With this notation, the quotient graph G/B=VBEBis defined as

VB=Y1…Yp,EB=YiYj∃v∈Yiw∈Yjs.t.vw∈E.E29

An edge connects two supervertices Yiand Yjif there exists an edge from a vertex in Aiito a vertex in Ajjin the graph VEof A+AT.

The complete pre-processing and factorization process of VBARMS consists of the following steps.

**Step 1.** Find the block ordering PBof Asuch that, upon permutation, the matrix PBAPBThas fairly dense non-zero blocks. We use the angle-based graph compression algorithm proposed by Saad and described earlier to compute exact or approximate block structures in A.

**Step 2.** Scale the matrix at Step 1 in the form S1PBAPBTS2using two diagonal matrices S1and S2, so that the 1-norm of the largest entry in each row and column is smaller or equal than 1.

**Step 3.** Find the block independent sets ordering PIof the quotient graph G/B=VBEB. Apply the permutation to the matrix obtained at Step 2 as

PIS1PBAPBTS2PIT=DFEC.E30

We use a simple form of weighted greedy algorithm for computing the ordering PI. The algorithm is the same as the one used in ARMS, and described in [25]. It consists of traversing the vertices G/Bin the natural order 1,2,…,n, marking each visited vertex vand all of its nearest neighbors connected to vby an edge and adding vand each visited node that is not already marked to the independent set. We assign the weight ∥Y∥Fto each supervertex Y.

In the 2×2partitioning (30), the upper left-most matrix Dis block diagonal like in ARMS. However, due to the block permutation, the diagonal blocks of Dare additionally block sparse matrices, as opposed to simply sparse matrices in ARMS and in other forms of multilevel incomplete LU factorizations, see [51, 52]. The matrices F, E, Care also block sparse because of the same reason.

**Step 4.** Factorize the matrix (30) in the form

DFEC=L0EU−1I×UL−1F0A1,E31

where Iis the identity matrix of appropriate size, and form the reduced system with the Schur complement

A1=C−ED−1F.E32

The Schur complement is also block sparse and has the same block partitioning of C.

Steps 2–4 can be repeated on the reduced system a few times until the Schur complement is small enough. After one additional level, we obtain

that can be factored as

Denote as Aℓthe reduced Schur complement matrix at level ℓ, for ℓ>1. After scaling and preordering Aℓ, a system with the matrix

PIℓD1ℓAℓD2ℓPIℓT=DℓFℓEℓCℓ=Lℓ0EℓUℓ−1I×UℓLℓ−1Fℓ0Aℓ+1E35

needs to be solved, with

Aℓ+1=Cℓ−EℓDℓ−1Fℓ.E36

Calling

xℓ=yℓzℓ,bℓ=fℓgℓE37

the unknown solution vector and the right-hand side vector of system (35), the solution process with the above multilevel VBARMS factorization consists of level-by-level forward elimination followed by an exact solution on the last reduced system and suitable inverse permutation. The solving phase is sketched in Algorithm 1.

In VBARMS, we perform the factorization approximately, for memory efficiency. We use block ILU factorization with threshold to invert inexactly both the upper leftmost matrix Dℓ≈L¯ℓU¯ℓat each level ℓ, and the last level Schur complement matrix Aℓmax≈L¯SU¯S. The block ILU method used in VBARMS is a straightforward block variant of the one-level pointwise ILUT algorithm. We drop small blocks B∈RmB×nBin L¯ℓ, U¯ℓ, L¯S, U¯Swhenever ∥B∥FmB⋅nB<t, for a given user-defined threshold t. The block pivots in block ILU are inverted exactly by using GE with partial pivoting. In assembling the Schur complement matrix Aℓ+1at level ℓ, we take advantage of the finest block structure of Dℓ, Fℓ, Eℓ, Cℓ, imposed by the block ordering PBon the small (usually dense) blocks in the diagonal blocks of Dℓand the corresponding small off-diagonal blocks in Eℓand Fℓ; we call optimized level-3 BLAS routines [53] for computing Aℓ+1in Eq. (36). We do not drop entries in the Schur complement, except at the last level. The same threshold is applied in all these operations.

The VBARMS code is developed in the C language and is adapted from the existing ARMS code available in the ITSOL package [54]. The compressed sparse storage format of ARMS is modified to store block vectors and block matrices of variable size as a collection of contiguous non-zero dense blocks (we refer to this data storage format as VBCSR). First, we compute the factors L¯ℓ, U¯ℓand L¯ℓ−1Fℓby performing a variant of the IKJ version of the Gaussian Elimination algorithm, where index Iruns from 2to mℓ, index Kfrom 1to I−1and index Jfrom K+1to nℓ. This loop applies implicitly L¯ℓ−1to the block row DℓFℓto produce UℓL¯ℓ−1Fℓ. In the second loop, Gaussian Elimination is performed on the block row EℓCℓusing the multipliers computed in the first loop to give EℓU¯ℓ−1and an approximation of the Schur complement Aℓ+1. Then, after Step 1, we permute explicitly the matrix at the first level as well as the matrices involved in the factorization at each new reordering step. For extensive performance assessment results of the VBARMS method, we point the reader to [50].

**Algorithm 1**VBARMS_Solve(Aℓ+1,bℓ). The solving phase with the VBARMS method.

**Require:** ℓ∈N∗, ℓmax∈N∗, bℓ=fℓgℓT

1: Solve Lℓy=fℓ

2: Compute gℓ′=gℓ−EℓUℓ−1y

3: **if** ℓ=ℓmax**then**

4: Solve Aℓ+1zℓ=gℓ′

5: else

6: CallVBARMS_Solve(Aℓ+1,gℓ′)

7: **end if**

8: Solve Uℓyℓ=y−Lℓ−1Fℓzℓ