The most significant aspect of parallel sparse LU factorization is that the sparsity structure can be exploited to offer more parallelism than is available with dense matrix solvers. Parallelism in dense matrix factorization is achieved by distributing the data in a manner that the calculations in one of the for loops can be performed in parallel. Sparse factorization algorithms have inadequate calculations in any row or column for efficient parallelism; however, sparse matrices offer additional parallelism as a result of the nature of the data and the precedence rules governing the order of calculations. Instead of just parallelizing a single for loop as in parallel dense matrix factorization, entire independent portions of a sparse matrix can be factored in parallel --- especially when the sparse matrix has been ordered into block-diagonal-bordered form.
Provided that a matrix can be ordered into block-diagonal-bordered form, the parallel sparse LU algorithm can reap additional benefits, such as the elimination of task graphs for distributed-memory multi-processor implementations. Minimizing or eliminating task graphs is significant because the task graph can contain as much information as the representation of the sparse matrix for more conventional parallel sparse LU solvers [18].
The same independence between diagonal blocks in these sparse matrices can also be exploited in our parallel sparse Gauss-Seidel algorithm. While Gauss-Seidel algorithms for dense matrices are inherently sequential, it is possible to identify portions of sparse matrices that do not have mutual data dependencies, so calculations can proceed in parallel on mutually independent matrix partitions while maintaining the strict precedence rules in the Gauss-Seidel technique. All parallelism in the Gauss-Seidel algorithm is derived from within the actual interconnection relationships between elements in the matrix. Furthermore, the extremely sparse last diagonal block also has inherent parallelism that can be identified by using graph multi-coloring techniques.
There are several distinct ways to illustrate available parallelism in
block-diagonal-bordered form matrices. Available parallelism in a
block-diagonal-bordered sparse matrix can be illustrated by the graph
of the matrix. Figure represents the form of a graph
with four mutually independent sub-matrices (subgraphs) interconnected
by shared coupling equations. No graph node in a subgraph has an
interconnection to another subgraph except through the coupling
equations. It should be intuitive that data in columns associated
with nodes in subgraphs can be factored independently up to the point
where the coupling equations are factored. The description of
parallelism presented here for direct methods is also closely related
to the concept of elimination graphs and super-nodes described in
[29]. For the parallel Gauss-Seidel, a concept similar to
elimination graphs could be utilized to depict the precedence in the
calculations.
A block-diagonal-bordered form sparse matrix can be represented by an
elimination or general precedence tree with supernodes at only two
levels. Supernodes are collections of nodes in the elimination tree
that are considered as a single entity, and form the elimination tree
leaves, with another supernode as the root of the tree.
By simply restructuring the graph presented in figure
, it
is possible to represent the same concept as a tree. An elimination
tree for a block-diagonal-bordered form matrix with four supernodes as
leaves and a single supernode as the tree's root is presented in
figure
. Each leaf supernode will have an inherent
hierarchy of calculations, that can be calculated independently of
other leaf supernodes. The root supernode for factorization
algorithms has little additional parallelism as a result of
independent calculations, and would be represented generally as a
vertical chain of nodes. This is due to fillin as a result of
factorization. Conversely, the last block for the
block-diagonal-bordered parallel Gauss-Seidel method represents only
the interconnection structure within the equations that couple the
partitions in the block-diagonal portion of the matrix.
Figure: Graph with Four Independent Sub-Matrices
Figure: Elimination Tree with Four Supernode Leaves
For iterative methods, the reduced interconnectivity in the last
diagonal block offers more available parallelism, but at the cost of
requiring multiple iterations for change in the value of any
to propagate throughout any interconnected values. Fillin in a
factored matrix increases the amount of calculations, but any
interconnection theoretically possible from the matrix graph
representation has been considered. Unfortunately, there is little
available parallelism inherent in factoring the last diagonal block,
so pipelined techniques similar to those used for dense matrices are
required for parallel factorization of the last diagonal block.