The primary factors affecting the performance of this parallel sparse Gauss-Seidel algorithm are available parallelism, load balancing overhead, and communications overhead. Our choice to order the power systems matrices into block-diagonal-bordered form and color the last diagonal block provides the means:
The single input parameter to the node-tearing algorithm, the maximum
partition size, when varied, affects the size of the diagonal blocks
and the size of the borders and last diagonal block.
Tables through
(appendix
) illustrate that the percentage of floating
point operations in the diagonal blocks and borders are a function of
the maximum partition size. To determine the partitioning with the
best parallel block-diagonal-bordered sparse Gauss-Seidel performance,
we examined the empirical data collected from algorithm benchmark
trials on the Thinking Machines CM-5 multi-computer. Graphs presented
in figure 7.13 illustrate the performance of four
Gauss-Seidel iterations and one convergence check for the
Boeing-Harwell matrices: BCSPWR09 and BCSPWR10. For each power system
network, a maximum of 32 nodes per partition yields the best overall
performance.
Figure 7.13: Parallel Gauss-Seidel Timing Data --- Double Precision
The graph for the Boeing-Harwell BCSPWR09 matrix in figure 7.13 presents families of curves for the empirical performance of the double precision Gauss-Seidel for six values of the maximum number of nodes per partition. Timing performance is reported in milliseconds. All partitionings have roughly the same performance for a single processor; however, performance becomes quite variable for 32 processors. We are interested in the fastest times for a partitioning at the number of processors. Performance data is plotted on log-log scales, so for every doubling of the number of processors, the run time would be halved for perfect speedup and each curve would appear as a straight line. However, load imbalance is evident for some partitionings when the curves show no performance improvement for increasing from sixteen to 32 processors. Nevertheless, there are orderings that yield scalable performance even for 32 processors.
For small numbers of processors, performance is best with large partitions; however, smaller partitions yield the best performance for large numbers of processors. The size of the last diagonal block, and the communications overhead associated with solving that portion of the matrix is inversely related to the maximum partition size. The best performance at a number of processors is a trade-off of the amount of parallel work in the diagonal blocks and the border versus the amount of communications in the last diagonal block. Consequently, the best performance for a power systems network will be dependent both on the number of processors and the partition size.
The graph for the Boeing-Harwell BCSPWR10 matrix in figure 7.13 presents families of curves for the empirical performance of the double precision Gauss-Seidel for nine values of the maximum number of nodes per partition. This matrix has both more rows/columns and a slightly higher average number of non-zeros per row than the BCSPWR09 matrix. As a result, there are more partitions that have shown only limited effects of workload imbalance for 32 processors.
The performance for the complex variate parallel sparse Gauss-Seidel algorithm is similar to the double precision algorithm performance for these two matrices.