Implementations for both parallel block-diagonal-bordered sparse LU
and Choleski factorization have been developed in the C programming
language for the Thinking Machines CM-5 multi-processor using a
host-node paradigm with explicit message passing. Two versions of
each parallel block-diagonal-bordered algorithm have been implemented:
one implementation uses low-latency active messages to update the last
diagonal block using data in the borders and the second implementation
uses conventional high(er) latency non-blocking, buffered
interprocessor communications. The communications paradigms for these
two implementations differ significantly. The communications
paradigm we used with active messages, is to calculate a vector
vector product and immediately send the value to the
processor holding the corresponding value of
. The
communications paradigm for buffered communications is to
perform the vector
vector products, store them in a buffer,
and then have each processor send buffers to all other processors. The
low-latency, active message communications paradigm greatly simplified
development of the algorithm, and the empirical results presented in
the next chapter, show that (not unexpectedly) the low-latency
communication implementation is substantially faster.
The block-diagonal-bordered LU factorization algorithm can be broken
into three component parts as defined in the derivation on available
parallelism in chapter . Pseudo-code representations of
each parallel algorithm section are presented separately in
figures
through
in appendix
.
In particular, each of these figures correspond to the following
figure numbers:
The LU factorization algorithm for the diagonal blocks and border follows a Doolittle's form and is a fan-in type algorithm. Conversely, the LU factorization algorithm for the last diagonal block follows a fan-out algorithm requiring rank-1 sub-matrix updates as each row or block of rows are factored.
The algorithm section that updates the last diagonal block calculates
a sparse matrix sparse matrix product by calculating
individual sparse vector
sparse vector products for lower
border rows and upper border columns. These partial sums must be
distributed to the proper processor holding data in the last diagonal
block. Separate sparse vector products are performed for each block
of data on a processor. Only non-zero rows in the lower border and
non-zero columns in the upper border are utilized when calculating
vector
vector products to generate the required partial sum
values to update the last diagonal block. Examining only non-zero
values significantly limits the amounts of calculations in this phase.
In the process of developing an implementation with optimal
performance, we discovered that any attempt to consolidate updates
to a value in the last diagonal block caused more overhead than was
encountered by sending multiple update values to the same processor.
There is more work required to sum update data than to calculate the
sparse vector products. Likewise, there has been no attempt at
parallel reduction of the partial sums of updates from the borders.
This block-diagonal-bordered LU factorization implementation solves the last block using a sparse blocked kji-saxpy LU algorithm based on the dense kij-saxpy algorithm described in [61]. Here k is the index corresponding to the diagonal, i is the index corresponding to rows, and j is the index corresponding to columns. Our examinations of power systems network matrices showed that after partitioning these matrices into block-diagonal-bordered form, there is little additional available parallelism in the last diagonal block --- insufficient inherent parallelism to be exploited in a load-balanced manner. Consequently, a pipelined algorithm has been used. The algorithm to factor the last diagonal block maintains the blocked format and pipelined nature of the dense kij-saxpy algorithm; however, special changes were made to the algorithm in order to minimize calculations and to minimize overhead when performing communications. The three significant changes to the kij-saxpy algorithm are:
While changing the order of calculations in each rank-1 update of the
partially-factored sub-matrix has little effect on the factorization
algorithm, it is equivalent to exchanging the order of for
loops. This seemingly small modification contributed significantly to
improving the performance of the forward reduction and backward
substitution steps. This small modification reduced the amount of
communications greatly during both forward reduction and backward
substitution by allowing the broadcast of only calculated values of
and
and not also requiring the broadcast of
partial sums encountered when updating values.
As would be expected with a sparse matrix, data is stored as sparse vectors with explicit row and column identifiers. To optimize performance for a kji algorithm, data is stored in sparse vectors corresponding to rows in the matrix. These sparse vectors of row data are stored as three separate data structures:
Parallel block-diagonal-bordered Choleski factorization algorithms are
similar to the LU factorization algorithms presented in
figures through
--- with modifications to
account for the symmetric nature of the matrices used in Choleski
factorization. Choleski factorization has only about half of the
calculations of LU factorization, and block-diagonal-bordered Choleski
factorization has only half the number of updates to the last diagonal
block. The parallel Choleski algorithm using the active message
communications paradigm would see reduced communications when compared
to LU factorization; however, there would be no reduction in the
number of messages required in a buffered communications update of the
last diagonal block. While the buffers would be shorter, there is
still the requirement for each processor to communicate with all other
processors. In most instances, the message start-up latency
dominates, not the per-word transport costs. There would also be no
reduction in the amount of communications when factoring the last
block.
In other words, parallel Choleski factorization is a more difficult
algorithm to implement efficiently than LU factorization, because
there is a significant reduction in the amount of calculations without
a similar reduction in communications overhead. The
computation-to-communications ratio for Choleski factorization is
to
of LU factorization. Consequently, the
results in section 7.1 will compare the performance
of LU factorization and Choleski factorization to illustrate
performance as a function of the amount of floating point operations
versus the communications overhead. To get a better understanding of
this trend, a version of the parallel block-diagonal-bordered LU
factorization algorithm has been implemented for complex data.
Complex math has four floating point multiplies and four
subtracts/adds when compared to double precision floating point
multiply/subtract operations. With these three implementations, we
will be able to clearly illustrate the need for low-latency
communications for algorithms to solve power systems network linear
equations. These matrices are sufficiently sparse, that by increasing
the number of floating-point operations by two or eight times that of
Choleski factorization, there is a marked increase in the relative
parallel speedup of these algorithms. Communications overhead remains
constant and only the number of floating point operations increases.
By considering the capabilities of the target parallel architecture,
namely the computation-to-communications ratio or granularity, you can
identify the communications capabilities required in target parallel
architectures.