next up previous
Next: Ordering Up: Parallel LU Factorization of Previous: A Survey of

A New Three-step Preprocess Phase

The primary goal of this paper is to describe the significant benefits of using block-diagonal-bordered sparse matrices when performing sparse LU factorization in parallel on distributed-memory multi-processors. For parallel sparse block-diagonal-bordered matrix algorithms to be efficient when factoring irregular sparse matrices, the following three step preprocessing phase must be performed:

The first step determines the block-diagonal-bordered form; the second step determines the locations of fillin values for static data structures and also determines the number of calculations in independent blocks for the load balancing step; and the third step determines a mapping of data to processors for efficient implementation of the algorithm for the user's data. These three steps may be incorporated into an optimization framework that uses the three-step preprocessing phase to produce matrix orderings with optimal overall performance for a particular version of the block-diagonal-bordered sparse matrix factorization algorithm. Future research may examine the applicability of producing an automated tool that has the capabilities of finding both an optimal matrix ordering and an optimal version of the multi-processor algorithm that is driven by the characteristics of the sparse matrix. Different versions of the block-diagonal-bordered LU factorization algorithm are possible depending on the specific architecture and characteristics of the matrix. These algorithms will be discussed in more detail in section 4.

The metric for load balancing or the metric for an optimization routine to determine the most efficient overall ordering technique must be based on the actual workload required by the individual processors. This number may differ substantially from the number of equations assigned to processors because the number of calculations in an independent sub-block is a function of the number and location of non-zero matrix elements in that block --- not the number of equations in a block. Determining the actual workload may require a detailed simulation of all processing and interprocessor data communications.





next up previous
Next: Ordering Up: Parallel LU Factorization of Previous: A Survey of



David P. Koester
Sun Oct 22 16:27:33 EDT 1995