We have collected empirical data for parallel block-diagonal-bordered sparse direct methods on the Thinking Machines CM-5 multi-computer for three solver implementations ---
For the three solver implementations, there are increasing amounts of floating point calculations in double precision Choleski factorization, double precision LU factorization, and complex LU factorization, with a relative workload of approximately 1:2:8. Choleski algorithms have only approximately one half the number of floating point operations of LU algorithms, and complex floating point operations require four separate multiplications and four addition/subtraction operations for a single complex multiply/add operation. While there are differing amounts of calculations in these algorithms, there are nearly equal amounts of communications, thus the granularity of the algorithm increases proportionally to 1:2:8. The empirical timing data presented below will illustrate just how sensitive the parallel sparse direct solvers for power systems networks are to communications overhead. This sensitivity is not totally unexpected, given the extremely sparse nature of power systems matrices.
Communications in block-diagonal-bordered Choleski or LU factorization
occurs in two locations --- updating the last diagonal block using
data in the borders and factoring the last diagonal block. Because LU
factorization requires the update of versus only
, there are twice as many
calculations and twice as many values to distribute when updating the
last diagonal block for LU factorization versus Choleski
factorization. Meanwhile, there are equal amounts of communications
for LU and Choleski factorization when factoring the last diagonal
block. The parallel block-diagonal-bordered Choleski algorithm
requires that data in
be broadcast to all
processors in the pipelined algorithm that perform the rank 1 update
of the sub-matrix. However, for the last diagonal block in the parallel
block-diagonal-bordered LU factorization algorithm, only
must be broadcast during the parallel rank 1 update.
For this research, we are assuming that the matrices are position
symmetric, so
and
have equal
numbers of non-zero values.
As we examine the empirical results, we first describe the selection process to identify the matrix partitioning with the best parallel empirical performance. This reduces the amount of data we must consider when examining the performance of the implementations.