There are many factors that can effect the performance of parallel direct sparse linear solvers. The primary factors are available parallelism, load balancing, and communication overhead. Our choice to order the power systems matrices into block-diagonal-bordered form provides a means to significantly limit the task graph to factor the matrix and to make all communications regular. We have shown in section 8.1 that the node tearing algorithm can partition the power systems network matrices into block-diagonal-bordered form and offer substantial parallelism in the diagonal blocks and borders.
There is one input parameter to the node-tearing algorithm, the maximum partition size, which when varied, effects the size of the diagonal blocks and the size of the borders and last diagonal block. We also presented tables 2 through 6 to illustrate that the amount of fillin and the amount of floating point operations varied as a function of the maximum partition size. When determining the partitioning with the best parallel direct block-diagonal-bordered sparse linear solver performance, we examined the empirical data collected from algorithm benchmark trials on the Thinking Machines CM-5 multi-computer to make those selections. Graphs presented in figures 41 and 42 illustrate the performance for LU factorization and for the combination of the forward and backward triangular solution steps for the Boeing-Harwell matrices; BCSPWR09 and BCSPWR10. Each graph has timing data for double precision LU factorization and for forward reduction/backward substitution. These graphs are on a log-log scale and show that for each power system network, a maximum of 32 nodes per partition yields the best overall performance for factorization.
Figure 41: Parallel LU Factorization Timing Data --- BCSPWR09 --- Double Precision
Figure 42: Parallel LU Factorization Timing Data --- BCSPWR10 --- Double Precision
We are considering software to be embedded within a more extensive
power systems application, so we must examine efficient parallel
forward reduction and backward substitution algorithms in addition to
parallel factorization algorithms. These figures show that the time
to factor the matrix is only approximately an order of magnitude
greater than the time required to perform forward reduction and
backward substitution on a single processor. This is significant
because it illustrates the sparsity of these power system matrices.
Due to the reduced amount of calculations in the triangular solution
phases of solving a system of factored linear equations, these
algorithms are often ignored when parallel Choleski or LU
factorization algorithms are presented in the literature. For a dense
matrix, the number of calculation to factor a matrix is and
the number of calculations to triangular solve the matrix is
.
For dense matrices as large as these two matrices, there would be a
significant difference in wall-clock time between factorization and
triangular solutions, a difference that is not present here. As a
result, we must also consider the performance of the triangular
solution step, especially if there will be dishonest (re)use of
a factored matrix as it is repeatedly (re)used for multiple
triangular solutions. Meanwhile, this order of magnitude difference in
performance erodes for large numbers of processors, because it will be
shown that there is better relative speedup for the factorization
algorithms than for forward reduction and backward substitution.
In figure 42, we must consider the performance of the forward reduction/backward substitution step in selecting the better of the two partitioning; for a maximum of sixteen or 32 nodes per partition. Performance of the factorization algorithm for sixteen and 32 nodes per partition are nearly similar, although the performance of the triangular solution step is significantly better for 32 nodes per partition than 16 nodes per partition.