next up previous
Next: Conclusions Up: Empirical Results for Previous: Convergence Rate

Comparing Communications Paradigms

We have developed two versions of this parallel block-diagonal-bordered sparse linear solver: one version uses a low-latency, active message-based communications paradigm and the other uses a buffered communications-based paradigm. These communications paradigms significantly modified the respective algorithms as seen in section gif.

The graphs in figure 7.17 illustrate direct comparisons of relative speedup for the low-latency, active message-based communications implementation and the buffered communications implementations for two power systems network matrices: BCSPWR09 and BCSPWR10. Performance for the other data sets were similar. These figures clearly illustrate the superior performance of the low-latency communications paradigm for the parallel block-diagonal-bordered sparse Gauss-Seidel solver. The low-latency implementations are always faster, even for two processors, and clearly faster for 32 processors. For the algorithm based on a more traditional send and receive paradigm, performance quickly becomes unacceptable as the number of processors increases. With the buffered communications-based implementation no speedup was measured for 16 and 32 processors. Meanwhile, speedups of as great as fourteen were measured for the double precision low-latency communications-based implementation. The remainder of this section discusses the reasons for the drastic differences in algorithm performance as a function of interprocessor communications paradigms.

 
Figure 7.17: Relative Speedup --- Double Precision Parallel Gauss-Seidel 

For the low-latency communications-based parallel Gauss-Seidel algorithm, the amount of communications is greatly reduced by only sending values of to those processors that actually need them when solving for an iteration in the last diagonal block. Figure 7.18 illustrates the number of low-latency messages required to distribute the values of calculated, while figure 7.19 presents the percentage of low-latency messages required to distribute these values. Families of curves in figure 7.18 show that the number of low-latency messages increases apparently at linear rates (with a horizontal axis) for three of the five power systems networks. This implies that the number of low-latency messages increases at a rate proportional to . Meanwhile, figure 7.18 illustrates the percentages of data actually sent with the low-latency communications paradigm relative to the maximum possible for a broadcast. For 32 processors, only 10% to 18% of the broadcast values are actually required.

In figures 7.20 and 7.21, we explore this phenomenon further for the BCSPWR09 and BCSPWR10 power systems networks. These figures present families of histograms of the number of low-latency messages to distribute the values of . Each histogram shows the distribution of the number of required low-latency messages, and is labeled to emphasize the maximum number of messages, . For the BCSPWR09 network, the maximum number of processors requiring any single value of is only eleven. Likewise, for the BCSPWR10 network, the maximum number of processors requiring any single value of is only eight. Significantly reducing the amount of communications in this component of the algorithm makes a corresponding improvement in overall parallel Gauss-Seidel algorithm performance. As a result, we are able to attain speedup even for an algorithm component that could have been sequential, and would have limited the overall algorithm performance as a function of Amdahl's law.

 
Figure 7.18: Number of Low-Latency Messages Required to Distribute  

 
Figure 7.19: Percentage of Broadcast Values Required to Distribute  

 
Figure 7.20: BCSPWR09 --- Histograms of the Number of Messages to Distribute  

 
Figure 7.21: BCSPWR10 --- Histograms of the Number of Messages to Distribute  

The effect of reduced communications overhead can be clearly seen as performance of the low-latency algorithm is compared to performance of an algorithm with more traditional buffered communications. In the portion of the algorithm that solves for values of , the buffered communications implementation must broadcast the values of to all other processors before the next color can proceed. The number of communications messages is for traditional interprocessor communications. Consequently, as the number of processors increases, the number of messages increase but the buffered communications messages size decreases. For traditional message passing paradigms, the cost for communications increases dramatically as the number of processors increases, because each message incurs the same latency regardless of the amount of data sent. The cost of non-blocking, buffered communications is 86 seconds latency and seconds per word or four bytes of buffered data on the Thinking Machines CM-5. Meanwhile, with the low-latency paradigm, the cost of an active message is only seconds to transfer four words of data.

In figure 7.22, we present speedup comparisons of the low-latency communications-based algorithm and the buffered communications-based algorithm. This series of graphs includes a comparison of overall speedup of the low-latency communications paradigm versus the buffered communications paradigm, and separate graphs of speedup for each of the three portions of the algorithm that have interprocessor communications. Overall, the small operations matrices have speedups for low-latency communications of greater than 28. The speedups for the larger matrices are smaller, because buffered communications are more efficient --- more data can be sent in each message, improving performance for this message type. Nevertheless, for 32 processors, speedups from between nine and thirteen have been measured for four iterations and a convergence check on these power systems network matrices.

 
Figure 7.22: Low-Latency Communications Speedup --- Parallel Gauss-Seidel 

The other graphs in figure 7.22 illustrate the speedups for the algorithm components:



next up previous
Next: Conclusions Up: Empirical Results for Previous: Convergence Rate



David P. Koester
Sun Oct 22 17:27:14 EDT 1995