We have developed two versions of this parallel
block-diagonal-bordered sparse linear solver, one version uses a
low-latency, active message-based communications paradigm and the
other uses a buffered communications paradigm. These communications
paradigms significantly modified the respective algorithms as seen in
chapter . For all power systems networks examined, the
largest relative workload when factoring or forward reducing the
matrix is to update the last diagonal block. Increases in
communications overhead in this portion of the algorithm could
significantly affect parallel algorithm performance. In
figure 7.9, we present graphs of the speedups obtained
using low-latency communications versus buffered communications on the
CM-5 for factoring and reducing the matrices. These graphs show that
overall speedups can be as great as 1.6 for factoring the operations
matrices with an algorithm based on the low-latency communications
paradigm, but speedups are less for the larger planning networks.
Speedups are also greatest for the smaller operations matrices when
updating the last diagonal block during forward reduction.
Figure 7.9: Speedup --- Low-Latency versus Buffered Communications
The speedups reported in the two graphs in figure 7.9 are relative to the entire time required to factor or reduce the matrices in parallel for the respective number of processors. While updating the last block has the most relative workload for a single processor, for larger numbers of processors, this relative workload changes significantly. The last diagonal block algorithm section assumes a larger relative portion of the workload because this algorithm section is less efficient. To provide a better understanding of the algorithm speedup at the component level, we present speedup graphs for active messages versus buffered communications in figure 7.10. Speedups for the factorization algorithm component are as great as 2.8, while speedups for the reduction component are as much as 14.8. Low-latency communications have their greatest impact when there are fewer operations to offset the greater communications overhead of the buffered communications. This is most evident when comparing speedups for factorization versus forward reduction in figure 7.10.
Figure 7.10: Speedup --- Low-Latency versus Buffered Communications --- Update Last Block
It has been sufficiently difficult to obtain usable speedups for the triangular solutions with the low-latency communications paradigm, and performance reductions of 1.2 to 2.0 for buffered communications would have a significant impact on the usability of this algorithm. For complex-variate implementations of these LU algorithms, the effect was somewhat less than for double precision LU algorithms. Conversely, the Choleski implementation saw more pronounced speedups from active messages due to the reduced workload. There are the same number of buffered communications messages in all algorithms, with less data sent for the Choleski algorithms.