The remaining steps in the parallel algorithm are forward reduction and backward substitution. The parallel version of these algorithms take advantage of the fact that calculations can be performed in one of two distinct orders that preserve the precedence relation in the calculations [23]. The combination of these techniques is utilized to minimize communications times when solving for the last diagonal block. The forward reduction algorithm to operate with the parallel block-diagonal-bordered LU factorization algorithm can be broken into three component parts, similar to LU factorization. Pseudo-code representations of each parallel algorithm section are presented separately in figures 20 through 23. In particular, each of these figures correspond to the following figure numbers:
Figure 20: Parallel Block-diagonal-Bordered Sparse Forward Reduction Algorithm --- LU Factorization --- Diagonal Blocks and Border
Figure 21: Parallel Block-diagonal-Bordered Sparse Forward Reduction Algorithm --- LU Factorization --- Update the Last Diagonal Block --- Active Message Communications Paradigm
Figure 22: Parallel Block-diagonal-Bordered Sparse Forward Reduction Algorithm --- LU Factorization --- Update the Last Diagonal Block --- Asynchronous Buffered Communications Paradigm
Figure 23: Parallel Block-diagonal-Bordered Sparse Forward Reduction Algorithm --- LU Factorization --- Last Diagonal Block
The backward substitution algorithm to operate with the parallel
block-diagonal-bordered LU factorization algorithm can be broken into
two component parts, back substitute the last diagonal block then back
substitute the upper triangular matrix. The only interprocessor
communications required occurs when solving for in the last
diagonal block. The solution for
in this portion of the
matrix broadcasts the values of
to all processors, so those
values are available to the next step, solving for
to
in the remaining diagonal blocks. Pseudo-code
representations of each parallel algorithm section are presented
separately in figures 24 and 25, respectively for
backward substitution of the last diagonal block and backward
substitution of the diagonal blocks and border.
Figure 24: Parallel Block-diagonal-Bordered Sparse Backward Substitution Algorithm --- LU Factorization --- Last Diagonal Block
Figure 25: Parallel Block-diagonal-Bordered Sparse Backward Substitution Algorithm --- LU Factorization --- Diagonal Blocks and Border
Forward reduction and backward substitution algorithms for Choleski
factorization are similar to those for LU factorization, with one
major difference. The factorization process generates only a single
lower triangular matrix, L. For the last diagonal block, one
triangular solution step must occur in a manner that requires more
communications than an optimally implemented triangular solution for
LU factorization. The selection of kji factorization in the last
diagonal block for LU factorization was to maximize performance for
both forward reduction and backward substitution --- as a result
communications was minimized. Meanwhile, for Choleski factorization,
the optimal direct solver algorithm must use a column distribution for
the data in the last block, require additional communications and
calculations be incurred in the forward reduction of the last diagonal
block, and then backward substitute the last diagonal block using an
implicit transpose of L. The final step ensures that all
values are broadcast to all processors, eliminating an extra, costly
communications step. This combination of data distributions and
algorithm specifics ensures the least number of calculations and the
minimum amount of communications are performed and should offer the
best opportunity for good parallel performance.