next up previous
Next: Forward Reduction and Up: Parallel Sparse Direct Previous: The Hierarchical Data

Parallel Blocked-Diagonal-Bordered LU Factorization

Implementations for both parallel block-diagonal-bordered sparse LU and Choleski factorization have been developed in the C programming language for the Thinking Machines CM-5 multi-processor using a host-node paradigm with explicit message passing. Two versions of each parallel block-diagonal-bordered algorithm have been implemented: one implementation uses low-latency active messages to update the last diagonal block using data in the borders and the second implementation uses conventional high(er) latency non-blocking, buffered interprocessor communications. The communications paradigms for these two implementations differ significantly. The communications paradigm we used with active messages, is to calculate a vector vector product and immediately send the value to the processor holding the corresponding value of . The communications paradigm for buffered communications is to perform the vector vector products, store them in a buffer, and then have each processor send buffers to all other processors. The low-latency, active message communications paradigm greatly simplified development of the algorithm, and the empirical results presented in the next chapter, show that (not unexpectedly) the low-latency communication implementation is substantially faster.

The block-diagonal-bordered LU factorization algorithm can be broken into three component parts as defined in the derivation on available parallelism in chapter gif. Pseudo-code representations of each parallel algorithm section are presented separately in figures gif through gif in appendix gif. In particular, each of these figures correspond to the following figure numbers:

  1. factor the diagonal blocks and border --- figure gif,
  2. update the last diagonal block ---
  3. factor the last diagonal block --- figure gif.

The LU factorization algorithm for the diagonal blocks and border follows a Doolittle's form and is a fan-in type algorithm. Conversely, the LU factorization algorithm for the last diagonal block follows a fan-out algorithm requiring rank-1 sub-matrix updates as each row or block of rows are factored.

The algorithm section that updates the last diagonal block calculates a sparse matrix sparse matrix product by calculating individual sparse vector sparse vector products for lower border rows and upper border columns. These partial sums must be distributed to the proper processor holding data in the last diagonal block. Separate sparse vector products are performed for each block of data on a processor. Only non-zero rows in the lower border and non-zero columns in the upper border are utilized when calculating vector vector products to generate the required partial sum values to update the last diagonal block. Examining only non-zero values significantly limits the amounts of calculations in this phase. In the process of developing an implementation with optimal performance, we discovered that any attempt to consolidate updates to a value in the last diagonal block caused more overhead than was encountered by sending multiple update values to the same processor. There is more work required to sum update data than to calculate the sparse vector products. Likewise, there has been no attempt at parallel reduction of the partial sums of updates from the borders.

This block-diagonal-bordered LU factorization implementation solves the last block using a sparse blocked kji-saxpy LU algorithm based on the dense kij-saxpy algorithm described in [61]. Here k is the index corresponding to the diagonal, i is the index corresponding to rows, and j is the index corresponding to columns. Our examinations of power systems network matrices showed that after partitioning these matrices into block-diagonal-bordered form, there is little additional available parallelism in the last diagonal block --- insufficient inherent parallelism to be exploited in a load-balanced manner. Consequently, a pipelined algorithm has been used. The algorithm to factor the last diagonal block maintains the blocked format and pipelined nature of the dense kij-saxpy algorithm; however, special changes were made to the algorithm in order to minimize calculations and to minimize overhead when performing communications. The three significant changes to the kij-saxpy algorithm are:

While changing the order of calculations in each rank-1 update of the partially-factored sub-matrix has little effect on the factorization algorithm, it is equivalent to exchanging the order of for loops. This seemingly small modification contributed significantly to improving the performance of the forward reduction and backward substitution steps. This small modification reduced the amount of communications greatly during both forward reduction and backward substitution by allowing the broadcast of only calculated values of and and not also requiring the broadcast of partial sums encountered when updating values.

As would be expected with a sparse matrix, data is stored as sparse vectors with explicit row and column identifiers. To optimize performance for a kji algorithm, data is stored in sparse vectors corresponding to rows in the matrix. These sparse vectors of row data are stored as three separate data structures:

  1. values on the diagonal,
  2. values in the strictly lower triangular matrix,
  3. values in the strictly upper triangular matrix.
The use of three data structures greatly simplifies parallel rank-1 updates in this fan-out algorithm. In parallel LU factorization, only the data in either the strictly lower triangular or strictly upper triangular matrix must be broadcast to all processors when updating the sub-matrix. By storing the data in the triangular matrices in separate data structures, the data in a block can be accessed directly by the buffered communications software. With irregular sparse matrices, this storage technique is required to eliminate a memory-to-memory copy step required if data was stored in single sparse row or column vectors. The data is irregular and no regular strides can be used when forming the communications buffer if all data is stored contiguously for a row.

Parallel block-diagonal-bordered Choleski factorization algorithms are similar to the LU factorization algorithms presented in figures gif through gif --- with modifications to account for the symmetric nature of the matrices used in Choleski factorization. Choleski factorization has only about half of the calculations of LU factorization, and block-diagonal-bordered Choleski factorization has only half the number of updates to the last diagonal block. The parallel Choleski algorithm using the active message communications paradigm would see reduced communications when compared to LU factorization; however, there would be no reduction in the number of messages required in a buffered communications update of the last diagonal block. While the buffers would be shorter, there is still the requirement for each processor to communicate with all other processors. In most instances, the message start-up latency dominates, not the per-word transport costs. There would also be no reduction in the amount of communications when factoring the last block.

In other words, parallel Choleski factorization is a more difficult algorithm to implement efficiently than LU factorization, because there is a significant reduction in the amount of calculations without a similar reduction in communications overhead. The computation-to-communications ratio for Choleski factorization is to of LU factorization. Consequently, the results in section 7.1 will compare the performance of LU factorization and Choleski factorization to illustrate performance as a function of the amount of floating point operations versus the communications overhead. To get a better understanding of this trend, a version of the parallel block-diagonal-bordered LU factorization algorithm has been implemented for complex data. Complex math has four floating point multiplies and four subtracts/adds when compared to double precision floating point multiply/subtract operations. With these three implementations, we will be able to clearly illustrate the need for low-latency communications for algorithms to solve power systems network linear equations. These matrices are sufficiently sparse, that by increasing the number of floating-point operations by two or eight times that of Choleski factorization, there is a marked increase in the relative parallel speedup of these algorithms. Communications overhead remains constant and only the number of floating point operations increases. By considering the capabilities of the target parallel architecture, namely the computation-to-communications ratio or granularity, you can identify the communications capabilities required in target parallel architectures.



next up previous
Next: Forward Reduction and Up: Parallel Sparse Direct Previous: The Hierarchical Data



David P. Koester
Sun Oct 22 17:27:14 EDT 1995