The previous discussions have pointed that the solution of the large nonsymmetric sparse linear system (15) at each pseudo-time step is a major computational task of the whole flow simulation, especially when the mean flow and the turbulence transport equations are solved in fully coupled form, the Jacobian is computed exactly by means of FD, and the size of the time-step is rapidly increased to recover Newton’s algorithm. For convenience, we write system (15) in compact form as
4.1. Multi-elimination ILU factorization preconditioner
Incomplete LU factorization methods (ILUs) are an effective, yet simple, class of preconditioning techniques for solving large linear systems. They write in the form , where and are approximations of the and factors of the standard triangular LU decomposition of . The incomplete factorization may be computed directly from the Gaussian Elimination (GE) algorithm, by discarding some entries in the and U factors according to various strategies, see . A stable ILU factorization is proved to exist for arbitrary choices of the sparsity pattern of and only for particular classes of matrices, such as M-matrices  and H-matrices with positive diagonal entries . However, many techniques can help improve the quality of the preconditioner on more general problems, such as reordering, scaling, diagonal shifting, pivoting and condition estimators [25, 26, 27, 28]. As a result of this recent development, in the past decade successful experience have been reported using ILU preconditioners in areas that were of exclusive domain of direct solution methods like, in circuits simulation, power system networks, chemical engineering plants modeling, graphs and other problems not governed by PDEs, or in areas where direct methods have been traditionally preferred, such as structural analysis, semiconductor device modeling, computational fluid dynamics (see [29, 30, 31, 32, 33]).
Multi-elimination ILU factorization is a powerful class of ILU preconditioners, which combines the simplicity of ILU techniques with the robustness and high degree of parallelism of domain decomposition methods . It is developed on the idea that, due to sparsity, many unknowns of a linear system are not coupled by an equation (i.e. they are independent) and thus they can be eliminated simultaneously at a given stage of GE. If the, say , independent unknowns are numbered first, and the other unknowns last, the coefficient matrix of the system is permuted in a 2×2 block structure of the form
where is a diagonal matrix of dimension and is a square matrix of dimension . In multi-elimination methods, a reduced system is recursively constructed from (24) by computing a block LU factorization of of the form
where and are the triangular factors of the LU factorization of , is the Schur complement with respect to , is the identity matrix of dimension , and we denote and . The reduction process can be applied another time to the reduced system with , and recursively to each consecutively reduced system until the Schur complement is small enough to be solved with a standard method such as a dense LAPACK solver . Multi-elimination ILU factorization preconditioners may be obtained from the decomposition (25) by performing the reduction process inexactly, by dropping small entries in the Schur complement matrix and/or factorizing approximately at each reduction step. These preconditioners exhibit better parallelism than conventional ILU algorithms, due to the recursive factorization. Additionally, for comparable memory usage, they may be significantly more robust especially for solving large problems as the reduced system is typically small and better conditioned compared to the full system.
The factorization (25) defines a general framework which may accommodate for many different methods. An important distinction between various methods is rooted in the choice of the algorithm used to discover sets of independent unknowns. Many of these algorithms are borrowed from graph theory, where such sets are referred to as independent sets. Denoting as the adjacency graph of , where is the set of vertices and the set of edges, a vertex independent set S is defined as a subset of such that
The set is maximal if there is no other independent set containing strictly . Independent sets in a graph may be computed by simple greedy algorithms which traverse the vertices in the natural order , mark each visited vertex and all of its nearest neighbors connected to by an edge, and add and each visited node that is not already marked to the independent set . As an alternative to the greedy algorithm, the nested dissection ordering , mesh partitioning, or further information from the set of nested finite element grids of the underlying problem can be used [39, 40, 41].
The multilevel preconditioner considered in our study is the Algebraic Recursive Multilevel Solvers (ARMS) introduced in , which uses block independent sets computed by the simple greedy algorithm. Block independent sets are characterized by the property that unknowns of two different sets have no coupling, while unknowns within the same set may be coupled. In this case, the matrix appearing in (24) is block diagonal, and may typically consist of large-sized diagonal blocks that are factorized by an ILU factorization with threshold (ILUT ) for memory efficiency. In the ARMS implementation described in , first the incomplete triangular factors , of are computed by one sweep of ILUT, and an approximation to is also computed. In a second loop, an approximation to and an approximate Schur complement matrix are derived. This holds at each reduction level. At the last level, another sweep of ILUT is applied to the (last) reduced system. The blocks and are stored temporarily, and then discarded from the data structure after the Schur complement matrix is computed. Only the incomplete factors of at each level, those of the last level Schur matrix, and the permutation arrays are needed for the solving phase. By this implementation, dropping can be performed separately in the matrices , , , , . This in turns allows to factor accurately without incurring additional costs in and , achieving high computational and memory efficiency. Implementation details and careful selection of the parameters are always critical aspects to consider in the design of sparse matrix algorithms. Next, we show how to combine the ARMS method with matrix compression techniques to exploit the block structure of for better efficiency.
4.2. The variable-block ARMS factorization
The discretization of the Navier-Stokes equations for turbulent compressible flows assigns five distinct variables to each grid point (density, scaled energy, two components of the scaled velocity, and turbulence transport variable); these reduce to four for incompressible, constant density flows, and to three if additionally the flow is laminar. If the, say , distinct variables associated with the same node are numbered consecutively, the permuted matrix has a sparse block structure with non-zero blocks of size . The blocks are usually fully dense, as variables at the same node are mutually coupled. Exploiting any available block structure in the preconditioner design may bring several benefits , some of them are explained below:
Memory. A clear advantage is to store the matrix as a collection of blocks using the variable-block compressed sparse row (VBCSR) format, saving column indices and pointers for the block entries.
Stability. On indefinite problems, computing with blocks instead of single elements enables a better control of pivot breakdowns, near singularities, and other possible sources of numerical instabilities. Block ILU solvers may be used instead of pointwise ILU methods.
Complexity. Grouping variables in clusters, the Schur complement is smaller and hopefully the last reduced system is better conditioned and easier to solve.
Efficiency. A full block implementation, based on higher level optimized BLAS as computational kernels, may be designed leading to better flops to memory ratios on modern cache-based computer architectures.
Cache effects. Better cache reuse is possible for block algorithms.
It has been demonstrated that block iterative methods often exhibit faster convergence rate than their pointwise analogues for the solution of many classes of two- and three-dimensional partial differential equations (PDEs) [44, 45, 46]. For this reason, in the case of the simple Poisson’s equation with Dirichlet boundary conditions on a rectangle discretized uniformly by using points in the interval and points in , it is often convenient to number the interior points by lines from the bottom up in the natural ordering, so that one obtains a block tridiagonal matrix with square blocks of size ; the diagonal blocks are tridiagonal matrices and the off-diagonal blocks are diagonal matrices. For large finite element discretizations, it is common to use substructuring, where each substructure of the physical mesh corresponds to one sparse block of the system. If the domain is highly irregular or the matrix does not correspond to a differential equation, finding the best block partitioning is much less obvious. In this case, graph reordering techniques are worth considering.
The PArameterized BLock Ordering (PABLO) method proposed by O’Neil and Szyld is one of the first block reordering algorithms for sparse matrices . The algorithm selects groups of nodes in the adjacency graph of the coefficient matrix such that the corresponding diagonal blocks are either full or very dense. It has been shown that classical block stationary iterative methods such as block Gauss-Seidel and SOR methods combined with the PABLO ordering require fewer operations than their point analogues for the finite element discretization of a Dirichlet problem on a graded L-shaped region, as well as on the 9-point discretization of the Laplacian operator on a square grid. The complexity of the PABLO algorithm is proportional to the number of nodes and edges in both time and space.
Another useful approach to compute dense blocks in the sparsity pattern of a matrix is the method proposed by Ashcraft in . The algorithm searches for sets of rows or columns having the exact same pattern. From a graph viewpoint, it looks for vertices of the adjacency graph of having the same adjacency list. These are also called indistinguishable nodes or cliques. The algorithm assigns a checksum quantity to each vertex, using the function
and then sorts the vertices by their checksums. This operation takes time. If and are indistinguishable, then . Therefore, the algorithm examines nodes having the same checksum to see if they are indistinguishable. The ideal checksum function would assign a different value for each different row pattern that occurs but it is not practical because it may quickly lead to huge numbers that may not even be machine-representable. Since the time cost required by Ashcraft’s method is generally negligible relative to the time it takes to solve the system, simple checksum functions such as (27) are used in practice .
On the other hand, sparse unstructured matrices may sometimes exhibit approximate dense blocks consisting mostly of non-zero entries, except for a few zeros inside the blocks. By treating these few zeros as non-zero elements, with a little sacrifice of memory, a block ordering may be generated for an iterative solver. Approximate dense blocks in a matrix may be computed by numbering consecutively rows and columns having a similar non-zero structure. However, this would require a new checksum function that preserves the proximity of patterns, in the sense that close patterns would result in close checksum values. Unfortunately, this property does not hold true for Ashcraft’s algorithm in its original form. In , Saad proposed to compare angles of rows (or columns) to compute approximate dense structures in a matrix . Let be the pattern matrix of , which by definition has the same pattern as and the non-zero values are equal to 1. The method proposed by Saad computes the upper triangular part of . Entry is the inner product (the cosine value) between row and row of for . A parameter is used to gauge the proximity of row patterns. If the cosine of the angle between rows and is smaller than , row is added to the group of row . For the method will compute perfectly dense blocks, while for it may compute larger blocks where some zero entries are padded in the pattern. To speed up the search, it may be convenient to run a first pass with the checksum algorithm to detect rows having an identical pattern, and group them together; then, in a second pass, each non-assigned row is scanned again to determine whether it can be added to an existing group. Two important performance measures to gauge the quality of the block ordering computed are the average block density () value, defined as the amount of non-zeros in the matrix divided by the amount of elements in the non-zero blocks, and the average block size () value, which is the ratio between the sum of dimensions of the square diagonal blocks divided by the number of diagonal blocks. The cost of Saad’s method is closer to that of checksum-based methods for cases in which a good blocking already exists, and in most cases it remains inferior to the cost of the least expensive block LU factorization, i.e. block ILU(0).
Our recently developed variable-block variant of the ARMS method (VBARMS) incorporates an angle-based compression technique during the factorization to detect fine-grained dense structures in the linear system automatically, without any users knowledge of the underlying problem, and exploits them to improve the overall robustness and throughput of the basic multilevel algorithm . It is simpler to describe VBARMS from a graph point of view. Suppose to permute in block form as
where the diagonal blocks , are and the off-diagonal blocks are . We use upper case letters to denote matrix sub-blocks and lower case letters for individual matrix entries. We may represent the adjacency graph of by the quotient graph of . Calling the partition into blocks given by (28), we denote as the quotient graph obtained by coalescing the vertices assigned to the block (for ) into a supervertex . In other words, the entry in position of is a block of dimension , where is the cardinality of the set . With this notation, the quotient graph is defined as
An edge connects two supervertices and if there exists an edge from a vertex in to a vertex in in the graph of .
The complete pre-processing and factorization process of VBARMS consists of the following steps.
Step 1. Find the block ordering of such that, upon permutation, the matrix has fairly dense non-zero blocks. We use the angle-based graph compression algorithm proposed by Saad and described earlier to compute exact or approximate block structures in .
Step 2. Scale the matrix at Step 1 in the form using two diagonal matrices and , so that the 1-norm of the largest entry in each row and column is smaller or equal than 1.
Step 3. Find the block independent sets ordering of the quotient graph . Apply the permutation to the matrix obtained at Step 2 as
We use a simple form of weighted greedy algorithm for computing the ordering . The algorithm is the same as the one used in ARMS, and described in . It consists of traversing the vertices in the natural order , marking each visited vertex and all of its nearest neighbors connected to by an edge and adding and each visited node that is not already marked to the independent set. We assign the weight to each supervertex .
In the partitioning (30), the upper left-most matrix is block diagonal like in ARMS. However, due to the block permutation, the diagonal blocks of are additionally block sparse matrices, as opposed to simply sparse matrices in ARMS and in other forms of multilevel incomplete LU factorizations, see [51, 52]. The matrices , , are also block sparse because of the same reason.
Step 4. Factorize the matrix (30) in the form
where is the identity matrix of appropriate size, and form the reduced system with the Schur complement
The Schur complement is also block sparse and has the same block partitioning of .
Steps 2–4 can be repeated on the reduced system a few times until the Schur complement is small enough. After one additional level, we obtain
that can be factored as
Denote as the reduced Schur complement matrix at level , for . After scaling and preordering , a system with the matrix
needs to be solved, with
the unknown solution vector and the right-hand side vector of system (35), the solution process with the above multilevel VBARMS factorization consists of level-by-level forward elimination followed by an exact solution on the last reduced system and suitable inverse permutation. The solving phase is sketched in Algorithm 1.
In VBARMS, we perform the factorization approximately, for memory efficiency. We use block ILU factorization with threshold to invert inexactly both the upper leftmost matrix at each level , and the last level Schur complement matrix . The block ILU method used in VBARMS is a straightforward block variant of the one-level pointwise ILUT algorithm. We drop small blocks in , , , whenever , for a given user-defined threshold . The block pivots in block ILU are inverted exactly by using GE with partial pivoting. In assembling the Schur complement matrix at level , we take advantage of the finest block structure of , , , , imposed by the block ordering on the small (usually dense) blocks in the diagonal blocks of and the corresponding small off-diagonal blocks in and ; we call optimized level-3 BLAS routines  for computing in Eq. (36). We do not drop entries in the Schur complement, except at the last level. The same threshold is applied in all these operations.
The VBARMS code is developed in the C language and is adapted from the existing ARMS code available in the ITSOL package . The compressed sparse storage format of ARMS is modified to store block vectors and block matrices of variable size as a collection of contiguous non-zero dense blocks (we refer to this data storage format as VBCSR). First, we compute the factors , and by performing a variant of the IKJ version of the Gaussian Elimination algorithm, where index runs from to , index from to and index from to . This loop applies implicitly to the block row to produce . In the second loop, Gaussian Elimination is performed on the block row using the multipliers computed in the first loop to give and an approximation of the Schur complement . Then, after Step 1, we permute explicitly the matrix at the first level as well as the matrices involved in the factorization at each new reordering step. For extensive performance assessment results of the VBARMS method, we point the reader to .
Algorithm 1VBARMS_Solve(). The solving phase with the VBARMS method.
Require: , ,
3: if then
7: end if