next up previous
Next: Timing Performance Comparisons Up: Parallel Direct Sparse Previous: Parallel Direct Sparse

Selecting Partitioned Matrices with Best Parallel Solver Performance

The primary factors, that affect the performance of parallel direct sparse linear solvers, are available parallelism, load balancing, and communication overhead. Our choice to order the power systems matrices into block-diagonal-bordered form provides a means to significantly limit the task graph to factor the matrix and to make all communications regular. We have shown in section 7.1.1 that the node-tearing algorithm can partition the power systems network matrices into block-diagonal-bordered form and offer substantial parallelism in the diagonal blocks and borders.

The single input parameter to the node-tearing algorithm, the maximum partition size, when varied, affects the size of the diagonal blocks and the size of the borders and last diagonal block. When determining the partitioning with the best parallel direct block-diagonal-bordered sparse linear solver performance, we examined the empirical data collected from algorithm benchmark trials on the Thinking Machines CM-5 multi-computer. Graphs presented in figure 7.4 illustrate the performance for LU factorization and the combination of the forward and backward triangular solution steps for the Boeing-Harwell matrices: BCSPWR09 and BCSPWR10. Each graph has timing data for double precision LU factorization and for forward reduction/backward substitution. These graphs are on a log-log scale and show that for each power system network, a maximum of 32 nodes per partition yields the best overall performance for factorization.

 
Figure 7.4: Parallel LU Factorization Timing Data --- Double Precision  

We are considering software to be embedded within a more extensive power systems application, so we must examine efficient parallel forward reduction and backward substitution algorithms in addition to parallel factorization algorithms. Due to the reduced amount of calculations in the triangular solution phases, these algorithms are often ignored when parallel Choleski or LU factorization algorithms are presented in the literature. However, graphs presented in figure 7.4 show that the time to factor the matrix is only approximately an order of magnitude greater than the time required to perform forward reduction and backward substitution on a single processor. This is a direct result of the extremely sparse nature of power system network matrices. For a dense matrix, the number of calculation to factor a matrix is and the number of calculations to triangular solve the matrix is . For dense matrices as large as these two matrices, there would be a significant difference in wall-clock time between factorization and triangular solutions, a difference that is not present here. As a result, we must also consider the performance of the triangular solution step, especially if there will be dishonest (re)use of a factored matrix as it is repeatedly (re)used for multiple triangular solutions. Meanwhile, this order of magnitude difference in performance erodes for large numbers of processors, because it will be shown that there is better relative speedup for the factorization algorithms than for forward reduction and backward substitution.

For the BCSPWR10 power systems network in figure 7.4, we must consider the performance of the forward reduction/backward substitution step in selecting the optimum network partitioning. Performance of the factorization algorithm are nearly similar, although the performance of the triangular solution step is significantly better for 32 nodes per partition than 16 nodes per partition.



next up previous
Next: Timing Performance Comparisons Up: Parallel Direct Sparse Previous: Parallel Direct Sparse



David P. Koester
Sun Oct 22 17:27:14 EDT 1995