next up previous contents
Next: Empirical Results Up: Sparse Matrix Solver Previous: The Parallel Blocked-Diagonal-Bordered

Forward Reduction and Backward Substitution Algorithms

The remaining steps in the parallel algorithm are forward reduction and backward substitution. The parallel version of these algorithms take advantage of the fact that calculations can be performed in one of two distinct orders that preserve the precedence relation in the calculations [23]. The combination of these techniques is utilized to minimize communications times when solving for the last diagonal block. The forward reduction algorithm to operate with the parallel block-diagonal-bordered LU factorization algorithm can be broken into three component parts, similar to LU factorization. Pseudo-code representations of each parallel algorithm section are presented separately in figures 20 through 23. In particular, each of these figures correspond to the following figure numbers:

  1. forward reduce the diagonal blocks and border --- figure 20,
  2. update the last diagonal block,
  3. forward reduce the last diagonal block --- figure 23.
As with the LU factorization algorithms, in the actual parallel implementation, these program sections required a processor synchronization before starting any part of the algorithm. To better understand the algorithm, timing statistics were collected for each portion of the algorithm, and are presented in section 8.

 
Figure 20: Parallel Block-diagonal-Bordered Sparse Forward Reduction Algorithm --- LU Factorization --- Diagonal Blocks and Border 

 
Figure 21: Parallel Block-diagonal-Bordered Sparse Forward Reduction Algorithm --- LU Factorization --- Update the Last Diagonal Block --- Active Message Communications Paradigm 

 
Figure 22: Parallel Block-diagonal-Bordered Sparse Forward Reduction Algorithm --- LU Factorization --- Update the Last Diagonal Block --- Asynchronous Buffered Communications Paradigm 

 
Figure 23: Parallel Block-diagonal-Bordered Sparse Forward Reduction Algorithm --- LU Factorization --- Last Diagonal Block 

The backward substitution algorithm to operate with the parallel block-diagonal-bordered LU factorization algorithm can be broken into two component parts, back substitute the last diagonal block then back substitute the upper triangular matrix. The only interprocessor communications required occurs when solving for in the last diagonal block. The solution for in this portion of the matrix broadcasts the values of to all processors, so those values are available to the next step, solving for to in the remaining diagonal blocks. Pseudo-code representations of each parallel algorithm section are presented separately in figures 24 and 25, respectively for backward substitution of the last diagonal block and backward substitution of the diagonal blocks and border.

 
Figure 24: Parallel Block-diagonal-Bordered Sparse Backward Substitution Algorithm --- LU Factorization --- Last Diagonal Block 

 
Figure 25: Parallel Block-diagonal-Bordered Sparse Backward Substitution Algorithm --- LU Factorization --- Diagonal Blocks and Border 

Forward reduction and backward substitution algorithms for Choleski factorization are similar to those for LU factorization, with one major difference. The factorization process generates only a single lower triangular matrix, L. For the last diagonal block, one triangular solution step must occur in a manner that requires more communications than an optimally implemented triangular solution for LU factorization. The selection of kji factorization in the last diagonal block for LU factorization was to maximize performance for both forward reduction and backward substitution --- as a result communications was minimized. Meanwhile, for Choleski factorization, the optimal direct solver algorithm must use a column distribution for the data in the last block, require additional communications and calculations be incurred in the forward reduction of the last diagonal block, and then backward substitute the last diagonal block using an implicit transpose of L. The final step ensures that all values are broadcast to all processors, eliminating an extra, costly communications step. This combination of data distributions and algorithm specifics ensures the least number of calculations and the minimum amount of communications are performed and should offer the best opportunity for good parallel performance.



next up previous contents
Next: Empirical Results Up: Sparse Matrix Solver Previous: The Parallel Blocked-Diagonal-Bordered



David P. Koester
Sun Oct 22 15:31:10 EDT 1995