next up previous contents
Next: Empirical Results --- Up: Empirical Results --- Previous: Analyzing Algorithm Component

Comparing Communications Paradigms

We have developed two versions of this parallel block-diagonal-bordered sparse linear solver, one version using an active message communications paradigm and the other using an asynchronous non-blocking buffered communications paradigm. These communications paradigms significantly modified the respective algorithms as seen in section 7.2. For all power systems networks examined, the largest relative workload when factoring or forward reducing the matrix is to update the last diagonal block. Increases in communications overhead in this portion of the algorithm could significantly affect parallel algorithm performance. In figure 57 we present a graph of the speedups obtained using active messages on the CM-5 versus asynchronous non-blocking buffered communications for factoring the matrices. Likewise, in figure 58 we present a graph of the speedups obtained using active messages versus conventional communications for forward reduction of the matrices. These graphs show that speedups can be as great as 1.6 for factoring the operations matrices with active messages, but speedups are less for the planning networks. Speedups are also greatest for the smaller operations matrices for updating the last diagonal block during forward reduction.

 
Figure 57: Speedup Active Messages versus Buffered Communications - Double Precision LU Factorization  

 
Figure 58: Speedup Active Messages versus Buffered Communications - Double Precision Forward Reduction  

The speedups reported in figures 57 and 58 are relative to the entire time required to factor or reduce the matrices in parallel for the respective number of processors. While updating the last block has the most relative workload for a single processor, for larger numbers of processors, this relative workload changes significantly and the algorithm section for the last diagonal block assumes a larger relative workload because these algorithm sections are less efficient. To provide a better understanding of the algorithm component speedup, we present speedup graphs for active messages versus buffered communications in figures 59 and 60. Speedups for the factorization algorithm component are as great as 2.8, while speedups for the reduction component are as much as 14.8.

 
Figure 59: Speedup Active Messages versus Buffered Communications - Double Precision LU Factorization --- Update Last Block  

 
Figure 60: Speedup Active Messages versus Buffered Communications - Double Precision Forward Reduction --- Update Last Block  

Active message communications have their greatest improvement when there are fewer operations to offset the additional communications overhead of the buffered communications. This is most evident when comparing speedups for factorization versus forward reduction in figures 59 and 60. It has been sufficiently difficult to obtain usable speedups for the triangular solutions with the active message communications paradigm. Performance reductions of 1.2 to 2.0 for buffered communications would have a significant effect on the usability of this algorithm. For complex-variate implementations of these LU algorithms, the effect was somewhat less than for the double precision versions of the LU algorithms. Conversely, the Choleski implementation saw more pronounced speedups from active messages due to the reduced workload. There are the same number of buffered communications in all algorithms, with less data sent for the Choleski algorithms.



next up previous contents
Next: Empirical Results --- Up: Empirical Results --- Previous: Analyzing Algorithm Component



David P. Koester
Sun Oct 22 15:31:10 EDT 1995