We have collected empirical performance data for the parallel block-diagonal-bordered sparse Gauss-Seidel method running on the Thinking Machines CM-5 multi-computer for two implementations ---
As we examine the empirical results, we first describe the selection process to identify the matrix partitioning that yielded the best parallel empirical performance for each number of processors. This reduces the amount of data we must consider when examining the performance of the implementations. For the two iterative solver implementations, there are increasing amounts of floating point calculations with the relative workload on a single processor of approximately 1:4, for double precision versus complex variate versions of the algorithms. Complex floating point operations require four separate multiplications and four addition/subtraction operations for a single complex precision multiply/add operation. While there are differing amounts of floating point calculations in these algorithms, there are equal amounts of communications, thus the granularity of the algorithm increases proportionally to 1:4.
We will present timing comparisons that illustrate the effect that
partition size has on this parallel sparse Gauss-Seidel algorithm. We
next examine relative speedup for the two solver implementations, and
then examine the performance of the load-balancing step by examining
the timing data for each component of the algorithm. Lastly, we
discuss the performance improvements achieved by using low-latency
communications and the corresponding simplifications to the algorithm
that were possible using the low-latency communications paradigm. The
low-latency communications paradigm permits an important
implementation difference. The low-latency, active message-based
implementation has only the minimal necessary communications when
calculating values for in the last
diagonal block, which greatly improves the performance of the
low-latency implementation when compared to the buffered
communications implementation.