next up previous
Next: Algorithm Performance for Up: Empirical Results for Previous: Analyzing Algorithm Component

Comparing Communications Paradigms

We have developed two versions of this parallel block-diagonal-bordered sparse linear solver, one version uses a low-latency, active message-based communications paradigm and the other uses a buffered communications paradigm. These communications paradigms significantly modified the respective algorithms as seen in chapter gif. For all power systems networks examined, the largest relative workload when factoring or forward reducing the matrix is to update the last diagonal block. Increases in communications overhead in this portion of the algorithm could significantly affect parallel algorithm performance. In figure 7.9, we present graphs of the speedups obtained using low-latency communications versus buffered communications on the CM-5 for factoring and reducing the matrices. These graphs show that overall speedups can be as great as 1.6 for factoring the operations matrices with an algorithm based on the low-latency communications paradigm, but speedups are less for the larger planning networks. Speedups are also greatest for the smaller operations matrices when updating the last diagonal block during forward reduction.

 
Figure 7.9: Speedup --- Low-Latency versus Buffered Communications  

The speedups reported in the two graphs in figure 7.9 are relative to the entire time required to factor or reduce the matrices in parallel for the respective number of processors. While updating the last block has the most relative workload for a single processor, for larger numbers of processors, this relative workload changes significantly. The last diagonal block algorithm section assumes a larger relative portion of the workload because this algorithm section is less efficient. To provide a better understanding of the algorithm speedup at the component level, we present speedup graphs for active messages versus buffered communications in figure 7.10. Speedups for the factorization algorithm component are as great as 2.8, while speedups for the reduction component are as much as 14.8. Low-latency communications have their greatest impact when there are fewer operations to offset the greater communications overhead of the buffered communications. This is most evident when comparing speedups for factorization versus forward reduction in figure 7.10.

 
Figure 7.10: Speedup --- Low-Latency versus Buffered Communications --- Update Last Block  

It has been sufficiently difficult to obtain usable speedups for the triangular solutions with the low-latency communications paradigm, and performance reductions of 1.2 to 2.0 for buffered communications would have a significant impact on the usability of this algorithm. For complex-variate implementations of these LU algorithms, the effect was somewhat less than for double precision LU algorithms. Conversely, the Choleski implementation saw more pronounced speedups from active messages due to the reduced workload. There are the same number of buffered communications messages in all algorithms, with less data sent for the Choleski algorithms.



next up previous
Next: Algorithm Performance for Up: Empirical Results for Previous: Analyzing Algorithm Component



David P. Koester
Sun Oct 22 17:27:14 EDT 1995