Block-diagonal-bordered matrices have the potential for efficient
reduction and substitution steps because of limited, regular
communications, as seen in figure 1(b). The solution of
values in the y vector corresponding to the independent diagonal
blocks can be solved in a highly parallel manner. For the forward
reduction step, the only communications required are to send the
partial sums of products from rows in the borders to those processors
that hold the data for the last block and these partial sums can be
reduced in parallel in the same manner as the reduction of partial
sums in the factorization step, depicted in
figure 11. In essence, the partial sums of
products of ,
such that
are used to modify the value of
. After all updates have been
made using the values of
in the independent blocks, the
forward reduction of the last block can proceed using the best
available techniques for the nature of this sub-matrix.
For the backward substitution step that calculates the values in the
vector x, the values must be broadcast to all processors. If the
data in the last block is distributed to multiple processors, each
value must be broadcast to all other processors immediately after they
are calculated. In this instance, special techniques that support
lightweight communications processes, such as active messages
[22] on the Thinking Machines CM-5 can be used to minimize
communications latency. Regardless of the communications capabilities
of the distributed-memory multi-processor, as soon as values in the
x vector from the last block have been broadcast to all processors,
calculations to update the y vector can be made with
values in the borders because of the sub-matrix precedence relationship
in backward substitution. After all values in the x vector from the
last block have been solved, the remaining calculations to solve for
values of x in the independent blocks can be performed in parallel
without requiring additional interprocessor communications, because
all data for rows are allocated to a processor and because blocks are
independent with no requirement to send x vector values to other
processors.
There are additional algorithm implementation features that can improve the efficiency of forward reduction and backward substitution. By including information in the data structures on the row and column locations of a value in the implicit storage of the matrix, indexing overhead can be minimized by permitting modifications in both the forward reduction and backward substitution steps that do not require the calculation of the row or column locations of a value.