next up previous
Next: Examining Speedup Up: Parallel Sparse Gauss-Seidel Previous: Selecting Partitioned Matrices

Timing Performance Comparisons

We present timing comparisons in figure 7.14 for two sample power system networks, BCSPWR09 and BCSPWR10, with curves for both implementations --- double precision and complex. These graphs illustrate the sensitivity of the parallel sparse power systems network Gauss-Seidel solvers to load imbalance overhead for 32 processors. Meanwhile, the graphs also illustrate the relative insensitivity of the low-latency parallel Gauss-Seidel implementations to communications overhead.

 
Figure 7.14: Best Empirical Performance for Parallel Gauss-Seidel Implementations  

For 16 and 32 processors, smaller partitions are required to minimize load imbalance overhead. The empirical data for both complex and double precision implementations on each graph include the maximum number of nodes per partition at each plotted point. The general trend in the performance data is for significantly smaller partitions for 16 and 32 processors, causing a trade-off in small increases in communications overhead in return for (possible significant) decreases in load-imbalance overhead. This trade-off yields significant results for the BCSPWR09, BCSPWR10, and NiMo-OPS power systems networks, although load-imbalance overhead dominates the results for 32 processors with the EPRI6K and NiMo-PLANS power systems network planning matrices.

There are increasing amounts of floating point calculations in the double precision and complex Gauss-Seidel, with a relative workload of approximately 1:4. While there are differing amounts of calculations in these algorithms, there are equal amounts of communications. Additional floating point operations can do little to mitigate load imbalance; however, they can improve the computation-to-communications ratio or granularity within the algorithm, and as a result, the additional computations can have a positive effect on implementation performance. The graphs in figure 7.14 illustrate that there is some improvement in performance due to computation-to-communications granularity, although the performance improvement is minimal. The difference in performance can be seen by examining the relative slopes of the curve splines between 16 and 32 processors. For the complex variate implementation, this spline always has a steeper negative slope, but the slope differences are still minimal. As a result, we can conclude that the low-latency parallel block-diagonal-bordered Gauss-Seidel implementation is not as sensitive to granularity as the parallel block-diagonal-bordered direct solvers. Communications overhead does not significantly detract from nearly perfect parallel speedups.



next up previous
Next: Examining Speedup Up: Parallel Sparse Gauss-Seidel Previous: Selecting Partitioned Matrices



David P. Koester
Sun Oct 22 17:27:14 EDT 1995