next up previous contents
Next: Selecting Partitioned Matrices Up: Empirical Results Previous: Empirical Results ---

Empirical Results --- Parallel Direct Sparse Solver Performance

We have collected empirical data for parallel block-diagonal-bordered sparse direct methods on the Thinking Machines CM-5 multi-computer for three solver implementations ---

  1. Choleski factorization and forward reduction/backward substitution for double precision variables
  2. LU factorization and forward reduction/backward substitution for double precision variables
  3. LU factorization and forward reduction/backward substitution for complex variables
for each of two communications paradigms ---
  1. active message based communications
  2. asynchronous, non-blocking, buffered communications
for five separate power systems networks ---
  1. BCSPWR09
  2. BCSPWR10
  3. EPRI6K
  4. NiMo-OPS
  5. NiMo-PLANS
for 1, 2, 4, 8, 16, and 32 processors --- and for four matrix partitioning --- with a maximum of 16, 32, 64, and 96 graph nodes per partition.

As we examine the empirical results, we first describe the selection process to identify the matrix partitioning that yielded the best parallel empirical performance. This reduces the amount of data we must consider when examining the performance of the implementations. For the three solver implementations, there are increasing amounts of floating point calculations in double precision Choleski factorization, double precision LU factorization, and complex LU factorization, with the relative workload on a single processor of approximately 6:2:1, because Choleski algorithms have only approximately one half the floating point operations of LU algorithms, and complex floating point operations require four separate multiplications and two addition/subtraction operations for a double precision multiply/add operation. While there are differing amounts of calculations in these algorithms, there are equal amounts of communications. We will present timing comparisons that illustrate how sensitive parallel sparse direct solvers for power systems networks are to communications overhead. This sensitivity is not totally unexpected, given the extremely sparse nature of power systems matrices. We next examine speedup for the three solver implementations, and examine speedups for both factorization and a combination of forward reduction and backward substitution. We then examine the performance of the load-balancing step by examining the timing data for each component of the algorithm. Lastly, we discuss the performance improvements achieved by using active message communications and the corresponding simplifications to the algorithm that were possible using this communications paradigm.





next up previous contents
Next: Selecting Partitioned Matrices Up: Empirical Results Previous: Empirical Results ---



David P. Koester
Sun Oct 22 15:31:10 EDT 1995