Next: Ordering
Up: Parallel LU Factorization of
Previous: A Survey of
The primary goal of this paper is to describe the significant benefits
of using block-diagonal-bordered sparse matrices when performing
sparse LU factorization in parallel on distributed-memory
multi-processors. For parallel sparse block-diagonal-bordered
matrix algorithms to be efficient when factoring irregular sparse
matrices, the following three step preprocessing phase must be
performed:
- order into block-diagonal-bordered form,
- pseudo factor to identify both fillin and the number of calculations for independent blocks, and
- load balance to uniformly distribute the calculations among processors.
The first step determines the block-diagonal-bordered form; the second
step determines the locations of fillin values for static data
structures and also determines the number of calculations in
independent blocks for the load balancing step; and the third step
determines a mapping of data to processors for efficient
implementation of the algorithm for the user's data. These three
steps may be incorporated into an optimization framework that uses the
three-step preprocessing phase to produce matrix orderings with
optimal overall performance for a particular version of the
block-diagonal-bordered sparse matrix factorization algorithm. Future
research may examine the applicability of producing an automated tool
that has the capabilities of finding both an optimal matrix ordering
and an optimal version of the multi-processor algorithm that is driven
by the characteristics of the sparse matrix. Different versions of
the block-diagonal-bordered LU factorization algorithm are possible
depending on the specific architecture and characteristics of the
matrix. These algorithms will be discussed in more detail in
section 4.
The metric for load balancing or the metric for an optimization routine to determine
the most efficient overall ordering technique must be based on the
actual workload required by the individual processors. This number
may differ substantially from the number of equations assigned to
processors because the number of calculations in an independent
sub-block is a function of the number and location of non-zero matrix
elements in that block --- not the number of equations in a block.
Determining the actual workload may require a detailed simulation of
all processing and interprocessor data communications.
Next: Ordering
Up: Parallel LU Factorization of
Previous: A Survey of
David P. Koester
Sun Oct 22 16:27:33 EDT 1995