next up previous
Next: Selecting Partitioned Matrices Up: Empirical Results for Previous: Ordering Power Systems

Parallel Sparse Gauss-Seidel Performance

We have collected empirical performance data for the parallel block-diagonal-bordered sparse Gauss-Seidel method running on the Thinking Machines CM-5 multi-computer for two implementations ---

  1. parallel sparse Gauss-Seidel for double precision variables,
  2. parallel sparse Gauss-Seidel for complex variables,
for each of two communications paradigms ---
  1. low-latency communications,
  2. buffered communications,
for five separate power systems networks ---
  1. BCSPWR09,
  2. BCSPWR10,
  3. EPRI6K,
  4. NiMo-OPS,
  5. NiMo-PLANS,
for 1, 2, 4, 8, 16, and 32 processors, and for as many as nine matrix partitions with the maximum graph nodes per partition varying from 16 to 512.

As we examine the empirical results, we first describe the selection process to identify the matrix partitioning that yielded the best parallel empirical performance for each number of processors. This reduces the amount of data we must consider when examining the performance of the implementations. For the two iterative solver implementations, there are increasing amounts of floating point calculations with the relative workload on a single processor of approximately 1:4, for double precision versus complex variate versions of the algorithms. Complex floating point operations require four separate multiplications and four addition/subtraction operations for a single complex precision multiply/add operation. While there are differing amounts of floating point calculations in these algorithms, there are equal amounts of communications, thus the granularity of the algorithm increases proportionally to 1:4.

We will present timing comparisons that illustrate the effect that partition size has on this parallel sparse Gauss-Seidel algorithm. We next examine relative speedup for the two solver implementations, and then examine the performance of the load-balancing step by examining the timing data for each component of the algorithm. Lastly, we discuss the performance improvements achieved by using low-latency communications and the corresponding simplifications to the algorithm that were possible using the low-latency communications paradigm. The low-latency communications paradigm permits an important implementation difference. The low-latency, active message-based implementation has only the minimal necessary communications when calculating values for in the last diagonal block, which greatly improves the performance of the low-latency implementation when compared to the buffered communications implementation.





next up previous
Next: Selecting Partitioned Matrices Up: Empirical Results for Previous: Ordering Power Systems



David P. Koester
Sun Oct 22 17:27:14 EDT 1995