The primary factors, that affect the performance of parallel direct sparse linear solvers, are available parallelism, load balancing, and communication overhead. Our choice to order the power systems matrices into block-diagonal-bordered form provides a means to significantly limit the task graph to factor the matrix and to make all communications regular. We have shown in section 7.1.1 that the node-tearing algorithm can partition the power systems network matrices into block-diagonal-bordered form and offer substantial parallelism in the diagonal blocks and borders.
The single input parameter to the node-tearing algorithm, the maximum partition size, when varied, affects the size of the diagonal blocks and the size of the borders and last diagonal block. When determining the partitioning with the best parallel direct block-diagonal-bordered sparse linear solver performance, we examined the empirical data collected from algorithm benchmark trials on the Thinking Machines CM-5 multi-computer. Graphs presented in figure 7.4 illustrate the performance for LU factorization and the combination of the forward and backward triangular solution steps for the Boeing-Harwell matrices: BCSPWR09 and BCSPWR10. Each graph has timing data for double precision LU factorization and for forward reduction/backward substitution. These graphs are on a log-log scale and show that for each power system network, a maximum of 32 nodes per partition yields the best overall performance for factorization.
Figure 7.4: Parallel LU Factorization Timing Data --- Double Precision
We are considering software to be embedded within a more extensive
power systems application, so we must examine efficient parallel
forward reduction and backward substitution algorithms in addition to
parallel factorization algorithms. Due to the reduced amount of
calculations in the triangular solution phases, these algorithms are
often ignored when parallel Choleski or LU factorization algorithms
are presented in the literature. However, graphs presented in
figure 7.4 show that the time to factor the matrix is
only approximately an order of magnitude greater than the time
required to perform forward reduction and backward substitution on a
single processor. This is a direct result of the extremely sparse
nature of power system network matrices. For a dense matrix, the
number of calculation to factor a matrix is and the number of
calculations to triangular solve the matrix is
. For dense
matrices as large as these two matrices, there would be a significant
difference in wall-clock time between factorization and triangular
solutions, a difference that is not present here. As a result, we
must also consider the performance of the triangular solution step,
especially if there will be dishonest (re)use of a factored
matrix as it is repeatedly (re)used for multiple triangular
solutions. Meanwhile, this order of magnitude difference in
performance erodes for large numbers of processors, because it will be
shown that there is better relative speedup for the factorization
algorithms than for forward reduction and backward substitution.
For the BCSPWR10 power systems network in figure 7.4, we must consider the performance of the forward reduction/backward substitution step in selecting the optimum network partitioning. Performance of the factorization algorithm are nearly similar, although the performance of the triangular solution step is significantly better for 32 nodes per partition than 16 nodes per partition.