Implementations for both parallel block-diagonal-bordered sparse LU
and Choleski factorization have been developed in the C programming
language for the Thinking Machines CM-5 multi-processor using a
host-node paradigm with message passing. Two versions of each
parallel block-diagonal-bordered algorithm have been implemented: one
implementation uses low latency active messages to update the last
diagonal block using data in the borders and the second implementation
uses conventional high(er) latency asynchronous buffered non-blocking
interprocessor communications. The communications paradigms for these
two implementations differed significantly. The communications
paradigm we used with active messages, is to calculate a vector
vector product and immediately send the value to the
processor holding the corresponding value of
. The
communications paradigm we used with asynchronous, buffered
communications was to perform the vector
vector products,
store them in a buffer, and the have each processor send buffers to
all other processors. The active message communications paradigm
greatly simplified development of the algorithm, and the empirical
results, presented in the next section, show that the active
message-based implementation is significantly faster.
The block-diagonal-bordered LU factorization algorithm can be broken into three component parts as defined in the derivation on available parallelism in section 4. Pseudo-code representations of each parallel algorithm section are presented separately in figures 16 through 19. In particular, each of these figures correspond to the following figure numbers:
Figure 16: Parallel Block-diagonal-Bordered Sparse LU Factorization Algorithm --- Diagonal Blocks and Border
Figure 17: Parallel Block-diagonal-Bordered Sparse LU Factorization Algorithm --- Update the Last Diagonal Block --- Active Message Communications Paradigm
Figure 18: Parallel Block-diagonal-Bordered Sparse LU Factorization Algorithm --- Update the Last Diagonal Block --- Asynchronous Buffered Communications Paradigm
Figure 19: Parallel Block-diagonal-Bordered Sparse LU Factorization Algorithm --- Last Diagonal Block
The LU factorization algorithm for the diagonal blocks and border follows a Doolittle's form and is a fan-in type algorithm. Conversely, the LU factorization algorithm for the last diagonal block follows a fan-out algorithm requiring rank-1 submatrix updates as each row or block of rows are factored.
By distributing the factorization of the last block to all active node
processors, interprocessor communications is required in the second
step of this algorithm. The algorithm section that updates the last
diagonal block calculates a sparse matrix sparse matrix
product by calculating individual sparse vector
sparse vector
products for lower border rows and upper border columns and these
partial sums must be distributed to the proper processor holding data
in the last diagonal block. Separate sparse vector products are
performed for each block of data on a processor. Only nonzero rows in
the lower border and non-zero columns in the upper border are utilized
when calculating vector
vector products to generate the
required partial sum values to update the last diagonal block.
Examining only non-zero values significantly limits the amounts of
calculations in this phase. In the process of developing an
implementation with optimal performance, it was discovered that any
attempt to consolidate updates to a value in the last diagonal block
caused more overhead than was encountered by sending multiple update
values to the same processor. There is a higher order of work
required to sum update data than to calculate the sparse vector
products. Likewise, there has been no attempt at parallel reduction
of the partial sums of updates from the borders.
This block-diagonal-bordered LU factorization implementation solves the last block using a sparse blocked kji-saxpy LU algorithm based on the dense kij-saxpy algorithm described in [43]. Our examinations of power systems network matrices showed that after partitioning these matrices into block-diagonal-bordered form, there is little additional available parallelism in the last diagonal block --- insufficient parallelism to be exploited in a load-balanced manner. The algorithm to factor the last diagonal block maintains the blocked format and pipelines nature of the dense kij-saxpy algorithm; however, special changes were made to the algorithm in order to minimize calculations and to minimize overhead when performing communications. The three significant changes to the kij-saxpy algorithm are:
While changing the order of calculations in each rank-1 update of the
partially-factored submatrix has little effect on the factorization
algorithm, it is equivalent to exchanging the order of for
loops, this seemingly small modification contributed significantly to
improving the performance of the forward reduction and backward
substitution steps. This small modification reduced the amount of
communications greatly during both forward reduction and backward
substitution by allowing the broadcast of only calculated values of
and
and not also requiring the broadcast of
partial sums encountered when updating values.
As would be expected with a sparse matrix, data is stored as sparse vectors with explicit row and column identifiers. To optimize performance for a kji algorithm, data is stored in sparse vectors corresponding to rows in the matrix. These sparse vectors of row data are stored as three separate data structures:
Parallel block-diagonal-bordered Choleski factorization algorithms are similar to the LU factorization algorithms presented in figures 16 through 19 --- with modifications to account for the symmetric nature of the matrices used in Choleski factorization. Choleski factorization has only about half of the calculations of LU factorization. Block-diagonal-bordered Choleski factorization has half the number of updates to the last diagonal block. The parallel Choleski algorithm using the active message communications paradigm would see reduced communications when compared to LU factorization; however, there would be no reduction in the number of messages required in a buffered communications update of the last diagonal block. While the buffers would be shorter, there is still the requirement for each processor to communicate with all other processors. In most instances, the message start-up latency dominates, not the per-word transport costs. There would also be no reduction in the amount of communications when factoring the last block.
In other words, parallel Choleski factorization is a more difficult algorithm to implement than LU factorization, because there is a significant reduction in the amount of calculations, without a similar reduction in communications overhead. Consequently, the results in section 8 will compare the performance of LU factorization and Choleski factorization to illustrate performance as a function of the amount of floating point operations versus the communications overhead. To get a better understanding of this trend, a version of the parallel block-diagonal-bordered LU factorization algorithm has been implemented for complex data. Complex math has four floating point multiplies and two subtracts/adds when compared to double precision floating point calculations. With these three implementations, we will be able to clearly illustrate the need for high-speed low-latency communications for algorithms to solve power systems network linear equations. These matrices are sufficiently sparse, that by increasing the number of floating-point operations by two or six times that of Choleski factorization, there is a marked increase in the relative parallel speedup of these algorithms. Communications overhead remains constant and only the number of floating point operations increases. By considering the capabilities of the target parallel architecture, namely the communications to computations ratio or granularity, you can identify what communications capabilities are required in your target parallel architecture.
The same implicit hierarchical data structure used in the parallel implementation can be readily used in a sequential implementation, where the entire matrix is placed on a single node processor on the CM5. Single-processor performance is measured in this manner, so that relative speedup can be calculated to examine parallel algorithm performance.