In this chapter, we describe the implementations of the parallel
block-diagonal-bordered sparse matrix solvers that we have developed
for the Thinking Machines CM-5 distributed-memory multi-processor.
All implementations have been developed with special concern for
sequential code optimization, in addition to optimization of the more
complex parallel code. Paramount to sequential code optimization is
matching the data structures to the algorithm in order to maximize
cache hits. This can be accomplished by attempting to always perform
operations on vectors of consecutively stored data. The
algorithm for the iterative Gauss-Seidel method makes this easy. The
Gauss-Seidel algorithm is essentially a matrix vector product
with multiple vector
vector products as new values of
are calculated for the
row. The sparse
implementation of Gauss-Seidel requires the calculation of a sparse
matrix
a dense vector product. Figure
illustrates the optimal way to perform the matrix
vector
product, with the (sparse) matrix stored in row-major form. In this
figure, long horizontal lines depict sparse vectors and the vertical
line representing
is a dense vector.
Figure: Optimal Row-Major Data Storage for Gauss-Seidel Algorithms
Algorithms for direct linear solvers have much more complicated access
patterns. For dense direct methods, normal storage of data in
conventional two-dimensional matrices always leaves the data in a
pattern that is not accessible in one direction as a vector, and
unless the software is written to optimize over strides in the matrix,
there may be cache hit problems when performing dense
factorization. Figure illustrates four possible
dense factorization implementations, two each for column-major storage
and row-major storage [30]. In this figure, long lines
depict dense vectors and dots represent scalar access.
Figure: Optimal Data Storage for Dense Factorization
For sparse direct linear solvers, we store the sparse matrices
implicitly, providing rich opportunities for improving sequential
program cache hits. It is simple to store data with separate
data structures for the diagonal, lower triangular, and upper
triangular portions of the matrix and have sparse vector
sparse vector products throughout the factorization calculations.
This is illustrated in figure
for a factorization
algorithm with lower triangular matrix data stored in row-major order
and upper triangular matrix storage in column-major order. Care must
be taken to use forward reduction and backward substitution algorithms
compatible with the data structures that optimize performance. This
is easy to perform, because the loops to perform the triangular
solutions can be performed in either order.
Figure: Optimal Data Storage for Sparse Factorization --- Doolittle's Algorithm
Now we examine methods to optimize parallel direct and parallel iterative solver implementations.