next up previous
Next: Preliminary test results Up: Sparse Matrix Solver Previous: The Sequential Algorithm

The Parallel Algorithm

Implementation of the parallel block-diagonal-bordered sparse matrix solver has been developed in the C programming language for the Thinking Machines CM-5 multi-processor using message passing and a host-node paradigm. The same implicit hierarchical data structure used in the sequential implementation is used in the parallel implementation, although the data structure is partitioned and the data are physically allocated to the distributed memory located with the various processors. This parallel implementation has been developed as an instrumented proof-of-concept to examine the efficiency of the data structures and the overhead of message passing when partial sums are sent to the last block to update values located there. This implementation is a simplified version of a block-diagonal-bordered sparse matrix solver, where the last block in this initial parallel implementation has been factored sequentially on the host processor. The actual code that implements the factorization algorithm presented in figure 16 contains extra node timing calls and communications to send this timing data back to the host processor for display. In spite of the extra communications overhead due to the instrumentation and the significant sequential portion of this proof-of-concept code, empirical performance data illustrates good speedup potential. Performance of this algorithm is discussed in the next section.

By limiting the factorization of the last block to a sequential process on the host processor, all partial sums of updates for values in the borders in the second phase of the factorization step can be sent with a single message to the host processor. A vector of structures identifying all non-zero elements in the last block is used to determine those partial sum values that need to be calculated. In order to minimize communications times in this factorization step, a single vector of all non-zero partial sums is constructed at each processor. For this implementation, there has been no attempt at parallel reduction of the partial sums of updates from the borders.

 
Figure 16: Parallel Sparse LU Factorization 

The remaining steps in the parallel algorithm are forward reduction and backward substitution. The parallel version of these algorithms combine row reduction/substitution and sub-matrix reduction/substitution. The combination of these techniques is utilized to minimize communications times, by generating long vectors. Later implementations of block-diagonal-bordered matrix algorithms will include the use of active messages for fast transfers of individual partial sums or for fast broadcast values of to other processors. Pseudo-code versions of the algorithms for parallel forward reduction and parallel backward substitution are presented in figures 17 and 18 respectively. As in the sequential algorithm, the parallel backward substitution step does not require a final division operation, because all values are equal to 1, Detailed versions of the factorization, forward reduction, and backward substitution algorithms are presented in appendix B.

 
Figure 17: Parallel Sparse Forward Reduction 

 
Figure 18: Parallel Sparse Backward Substitution 



next up previous
Next: Preliminary test results Up: Sparse Matrix Solver Previous: The Sequential Algorithm



David P. Koester
Sun Oct 22 16:27:33 EDT 1995