Implementation of the parallel block-diagonal-bordered sparse matrix solver has been developed in the C programming language for the Thinking Machines CM-5 multi-processor using message passing and a host-node paradigm. The same implicit hierarchical data structure used in the sequential implementation is used in the parallel implementation, although the data structure is partitioned and the data are physically allocated to the distributed memory located with the various processors. This parallel implementation has been developed as an instrumented proof-of-concept to examine the efficiency of the data structures and the overhead of message passing when partial sums are sent to the last block to update values located there. This implementation is a simplified version of a block-diagonal-bordered sparse matrix solver, where the last block in this initial parallel implementation has been factored sequentially on the host processor. The actual code that implements the factorization algorithm presented in figure 16 contains extra node timing calls and communications to send this timing data back to the host processor for display. In spite of the extra communications overhead due to the instrumentation and the significant sequential portion of this proof-of-concept code, empirical performance data illustrates good speedup potential. Performance of this algorithm is discussed in the next section.
By limiting the factorization of the last block to a sequential process on the host processor, all partial sums of updates for values in the borders in the second phase of the factorization step can be sent with a single message to the host processor. A vector of structures identifying all non-zero elements in the last block is used to determine those partial sum values that need to be calculated. In order to minimize communications times in this factorization step, a single vector of all non-zero partial sums is constructed at each processor. For this implementation, there has been no attempt at parallel reduction of the partial sums of updates from the borders.
Figure 16: Parallel Sparse LU Factorization
The remaining steps in the parallel algorithm are forward reduction
and backward substitution. The parallel version of these algorithms
combine row reduction/substitution and sub-matrix
reduction/substitution. The combination of these techniques is
utilized to minimize communications times, by generating long vectors.
Later implementations of block-diagonal-bordered matrix algorithms
will include the use of active messages for fast transfers of
individual partial sums or for fast broadcast values of to
other processors. Pseudo-code versions of the algorithms for parallel
forward reduction and parallel backward substitution are presented in
figures 17 and 18 respectively. As in the
sequential algorithm, the parallel backward substitution step does not
require a final division operation, because all values
are
equal to 1, Detailed versions of the factorization, forward
reduction, and backward substitution algorithms are presented in
appendix B.
Figure 17: Parallel Sparse Forward Reduction
Figure 18: Parallel Sparse Backward Substitution