next up previous
Next: Selecting Partitioned Matrices Up: Empirical Results for Previous: Ordering Power Systems

Parallel Direct Sparse Solver Performance

We have collected empirical data for parallel block-diagonal-bordered sparse direct methods on the Thinking Machines CM-5 multi-computer for three solver implementations ---

  1. Choleski factorization and forward reduction/backward substitution for double precision variables,
  2. LU factorization and forward reduction/backward substitution for double precision variables,
  3. LU factorization and forward reduction/backward substitution for complex variables,
for each of two communications paradigms ---
  1. low-latency communications,
  2. buffered communications,
for five separate power systems networks ---
  1. BCSPWR09,
  2. BCSPWR10,
  3. EPRI6K,
  4. NiMo-OPS,
  5. NiMo-PLANS,
for 1, 2, 4, 8, 16, and 32 processors, and for four matrix partitions with a maximum of 16, 32, 64, and 96 graph nodes per partition.

For the three solver implementations, there are increasing amounts of floating point calculations in double precision Choleski factorization, double precision LU factorization, and complex LU factorization, with a relative workload of approximately 1:2:8. Choleski algorithms have only approximately one half the number of floating point operations of LU algorithms, and complex floating point operations require four separate multiplications and four addition/subtraction operations for a single complex multiply/add operation. While there are differing amounts of calculations in these algorithms, there are nearly equal amounts of communications, thus the granularity of the algorithm increases proportionally to 1:2:8. The empirical timing data presented below will illustrate just how sensitive the parallel sparse direct solvers for power systems networks are to communications overhead. This sensitivity is not totally unexpected, given the extremely sparse nature of power systems matrices.

Communications in block-diagonal-bordered Choleski or LU factorization occurs in two locations --- updating the last diagonal block using data in the borders and factoring the last diagonal block. Because LU factorization requires the update of versus only , there are twice as many calculations and twice as many values to distribute when updating the last diagonal block for LU factorization versus Choleski factorization. Meanwhile, there are equal amounts of communications for LU and Choleski factorization when factoring the last diagonal block. The parallel block-diagonal-bordered Choleski algorithm requires that data in be broadcast to all processors in the pipelined algorithm that perform the rank 1 update of the sub-matrix. However, for the last diagonal block in the parallel block-diagonal-bordered LU factorization algorithm, only must be broadcast during the parallel rank 1 update. For this research, we are assuming that the matrices are position symmetric, so and have equal numbers of non-zero values.

As we examine the empirical results, we first describe the selection process to identify the matrix partitioning with the best parallel empirical performance. This reduces the amount of data we must consider when examining the performance of the implementations.





next up previous
Next: Selecting Partitioned Matrices Up: Empirical Results for Previous: Ordering Power Systems



David P. Koester
Sun Oct 22 17:27:14 EDT 1995