The remaining steps in the parallel algorithm are forward reduction
and backward substitution. The parallel version of these algorithms
take advantage of the fact that calculations can be performed in one
of two distinct orders that preserve the precedence relation in the
calculations [34]. The combination of these techniques is
utilized to minimize communications times when solving for the last
diagonal block. The forward reduction algorithm to operate with the
parallel block-diagonal-bordered LU factorization algorithm can be
broken into three component parts, similar to LU factorization.
Pseudo-code representations of each parallel algorithm section are
presented separately in figures through
in
appendix
. In particular, each of these figures
correspond to the following figure numbers:
The backward substitution algorithm to operate with the parallel
block-diagonal-bordered LU factorization algorithm can be broken into
two component parts, back substitute the last diagonal block then back
substitute the remaining upper triangular matrix. The only
interprocessor communications required occurs when solving for in the last diagonal block. The solution for
in this portion of the matrix broadcasts the values of
to all processors, so those values are available to the next
step, solving for
to
in the remaining
diagonal blocks. Pseudo-code representations of each parallel
algorithm section are presented separately in figures
and
in appendix
, respectively for
backward substitution of the last diagonal block and backward
substitution of the diagonal blocks and border.
Forward reduction and backward substitution algorithms for Choleski
factorization are similar to those for LU factorization, with one
major difference. The factorization process generates only a single
lower triangular matrix, . For the last diagonal block, one
triangular solution step must occur in a manner that requires more
communications than an optimally implemented triangular solution for
LU factorization. The kji order for LU factorization in the last
diagonal block has been selected to maximize performance for both
forward reduction and backward substitution --- by minimizing
communications overhead. Meanwhile, for Choleski factorization, the
optimal direct solver algorithm must use a column distribution for the
data in the last block, require additional communications be incurred
in the forward reduction of the last diagonal block, and then backward
substitute the last diagonal block using an implicit transpose of
. The final step ensures that all
values are
broadcast to all processors, eliminating an extra, costly
communications step. This combination of data distributions and
algorithm specifics ensures the least number of calculations and the
minimum amount of communications are performed and should offer the
best opportunity for good parallel performance.