next up previous
Next: Parallel Sparse Iterative Up: Parallel Sparse Direct Previous: Parallel Blocked-Diagonal-Bordered LU

Forward Reduction and Backward Substitution

The remaining steps in the parallel algorithm are forward reduction and backward substitution. The parallel version of these algorithms take advantage of the fact that calculations can be performed in one of two distinct orders that preserve the precedence relation in the calculations [34]. The combination of these techniques is utilized to minimize communications times when solving for the last diagonal block. The forward reduction algorithm to operate with the parallel block-diagonal-bordered LU factorization algorithm can be broken into three component parts, similar to LU factorization. Pseudo-code representations of each parallel algorithm section are presented separately in figures gif through gif in appendix gif. In particular, each of these figures correspond to the following figure numbers:

  1. forward reduce the diagonal blocks and border --- figure gif,
  2. update the last diagonal block ---
  3. forward reduce the last diagonal block --- figure gif.

The backward substitution algorithm to operate with the parallel block-diagonal-bordered LU factorization algorithm can be broken into two component parts, back substitute the last diagonal block then back substitute the remaining upper triangular matrix. The only interprocessor communications required occurs when solving for in the last diagonal block. The solution for in this portion of the matrix broadcasts the values of to all processors, so those values are available to the next step, solving for to in the remaining diagonal blocks. Pseudo-code representations of each parallel algorithm section are presented separately in figures gif and gif in appendix gif, respectively for backward substitution of the last diagonal block and backward substitution of the diagonal blocks and border.

Forward reduction and backward substitution algorithms for Choleski factorization are similar to those for LU factorization, with one major difference. The factorization process generates only a single lower triangular matrix, . For the last diagonal block, one triangular solution step must occur in a manner that requires more communications than an optimally implemented triangular solution for LU factorization. The kji order for LU factorization in the last diagonal block has been selected to maximize performance for both forward reduction and backward substitution --- by minimizing communications overhead. Meanwhile, for Choleski factorization, the optimal direct solver algorithm must use a column distribution for the data in the last block, require additional communications be incurred in the forward reduction of the last diagonal block, and then backward substitute the last diagonal block using an implicit transpose of . The final step ensures that all values are broadcast to all processors, eliminating an extra, costly communications step. This combination of data distributions and algorithm specifics ensures the least number of calculations and the minimum amount of communications are performed and should offer the best opportunity for good parallel performance.



next up previous
Next: Parallel Sparse Iterative Up: Parallel Sparse Direct Previous: Parallel Blocked-Diagonal-Bordered LU



David P. Koester
Sun Oct 22 17:27:14 EDT 1995