next up previous
Next: Timing Performance Comparisons Up: Parallel Sparse Gauss-Seidel Previous: Parallel Sparse Gauss-Seidel

Selecting Partitioned Matrices with Best Parallel Solver Performance

The primary factors affecting the performance of this parallel sparse Gauss-Seidel algorithm are available parallelism, load balancing overhead, and communications overhead. Our choice to order the power systems matrices into block-diagonal-bordered form and color the last diagonal block provides the means:

  1. to make all parallelism easy to visualize,
  2. to significantly limit the task graph when performing an iteration on the matrix,
  3. to minimize the amount of communications for the low-latency implementation or at least make all communications regular for the buffered communications implementation.
We have shown in section 7.2.1 that the node-tearing algorithm can partition power systems network matrices into block-diagonal-bordered form and offer substantial parallelism in the diagonal blocks and borders while the multi-coloring algorithm can provide parallelism in this portion of the calculations.

The single input parameter to the node-tearing algorithm, the maximum partition size, when varied, affects the size of the diagonal blocks and the size of the borders and last diagonal block. Tables gif through gif (appendix gif) illustrate that the percentage of floating point operations in the diagonal blocks and borders are a function of the maximum partition size. To determine the partitioning with the best parallel block-diagonal-bordered sparse Gauss-Seidel performance, we examined the empirical data collected from algorithm benchmark trials on the Thinking Machines CM-5 multi-computer. Graphs presented in figure 7.13 illustrate the performance of four Gauss-Seidel iterations and one convergence check for the Boeing-Harwell matrices: BCSPWR09 and BCSPWR10. For each power system network, a maximum of 32 nodes per partition yields the best overall performance.

 
Figure 7.13: Parallel Gauss-Seidel Timing Data --- Double Precision  

The graph for the Boeing-Harwell BCSPWR09 matrix in figure 7.13 presents families of curves for the empirical performance of the double precision Gauss-Seidel for six values of the maximum number of nodes per partition. Timing performance is reported in milliseconds. All partitionings have roughly the same performance for a single processor; however, performance becomes quite variable for 32 processors. We are interested in the fastest times for a partitioning at the number of processors. Performance data is plotted on log-log scales, so for every doubling of the number of processors, the run time would be halved for perfect speedup and each curve would appear as a straight line. However, load imbalance is evident for some partitionings when the curves show no performance improvement for increasing from sixteen to 32 processors. Nevertheless, there are orderings that yield scalable performance even for 32 processors.

For small numbers of processors, performance is best with large partitions; however, smaller partitions yield the best performance for large numbers of processors. The size of the last diagonal block, and the communications overhead associated with solving that portion of the matrix is inversely related to the maximum partition size. The best performance at a number of processors is a trade-off of the amount of parallel work in the diagonal blocks and the border versus the amount of communications in the last diagonal block. Consequently, the best performance for a power systems network will be dependent both on the number of processors and the partition size.

The graph for the Boeing-Harwell BCSPWR10 matrix in figure 7.13 presents families of curves for the empirical performance of the double precision Gauss-Seidel for nine values of the maximum number of nodes per partition. This matrix has both more rows/columns and a slightly higher average number of non-zeros per row than the BCSPWR09 matrix. As a result, there are more partitions that have shown only limited effects of workload imbalance for 32 processors.

The performance for the complex variate parallel sparse Gauss-Seidel algorithm is similar to the double precision algorithm performance for these two matrices.



next up previous
Next: Timing Performance Comparisons Up: Parallel Sparse Gauss-Seidel Previous: Parallel Sparse Gauss-Seidel



David P. Koester
Sun Oct 22 17:27:14 EDT 1995