We have developed two versions of this parallel
block-diagonal-bordered sparse linear solver: one version uses a
low-latency, active message-based communications paradigm and the
other uses a buffered communications-based paradigm. These
communications paradigms significantly modified the respective
algorithms as seen in section .
The graphs in figure 7.17 illustrate direct comparisons of relative speedup for the low-latency, active message-based communications implementation and the buffered communications implementations for two power systems network matrices: BCSPWR09 and BCSPWR10. Performance for the other data sets were similar. These figures clearly illustrate the superior performance of the low-latency communications paradigm for the parallel block-diagonal-bordered sparse Gauss-Seidel solver. The low-latency implementations are always faster, even for two processors, and clearly faster for 32 processors. For the algorithm based on a more traditional send and receive paradigm, performance quickly becomes unacceptable as the number of processors increases. With the buffered communications-based implementation no speedup was measured for 16 and 32 processors. Meanwhile, speedups of as great as fourteen were measured for the double precision low-latency communications-based implementation. The remainder of this section discusses the reasons for the drastic differences in algorithm performance as a function of interprocessor communications paradigms.
Figure 7.17: Relative Speedup --- Double Precision Parallel Gauss-Seidel
For the low-latency communications-based parallel Gauss-Seidel
algorithm, the amount of communications is greatly reduced by only
sending values of to those processors that
actually need them when solving for an iteration in the last diagonal
block. Figure 7.18 illustrates the number of low-latency
messages required to distribute the values of
calculated, while figure 7.19 presents the percentage
of low-latency messages required to distribute these values. Families
of curves in figure 7.18 show that the number of
low-latency messages increases apparently at linear rates (with a
horizontal axis) for three of the five power
systems networks. This implies that the number of low-latency
messages increases at a rate proportional to
.
Meanwhile, figure 7.18 illustrates the percentages of
data actually sent with the low-latency communications paradigm
relative to the maximum possible for a broadcast. For 32 processors,
only 10% to 18% of the broadcast values are actually required.
In figures 7.20 and 7.21, we explore this
phenomenon further for the BCSPWR09 and BCSPWR10 power systems
networks. These figures present families of histograms of the number
of low-latency messages to distribute the values of . Each histogram shows the distribution of the
number of required low-latency messages, and is labeled to emphasize the
maximum number of messages,
. For the BCSPWR09
network, the maximum number of processors requiring any single value
of
is only eleven. Likewise, for the BCSPWR10
network, the maximum number of processors requiring any single value
of
is only eight. Significantly reducing the
amount of communications in this component of the algorithm makes a
corresponding improvement in overall parallel Gauss-Seidel algorithm
performance. As a result, we are able to attain speedup even for an
algorithm component that could have been sequential, and would have
limited the overall algorithm performance as a function of Amdahl's
law.
Figure 7.18: Number of Low-Latency Messages Required to Distribute
Figure 7.19: Percentage of Broadcast Values Required to Distribute
Figure 7.20: BCSPWR09 --- Histograms of the Number of Messages to Distribute
Figure 7.21: BCSPWR10 --- Histograms of the Number of Messages to Distribute
The effect of reduced communications overhead can be clearly seen as
performance of the low-latency algorithm is compared to performance of
an algorithm with more traditional buffered communications. In the
portion of the algorithm that solves for values of , the buffered communications implementation must
broadcast the values of
to all other
processors before the next color can proceed. The number of
communications messages is
for traditional
interprocessor communications. Consequently, as the number of
processors increases, the number of messages increase but the buffered
communications messages size decreases. For traditional message
passing paradigms, the cost for communications increases dramatically
as the number of processors increases, because each message incurs the
same latency regardless of the amount of data sent. The cost of
non-blocking, buffered communications is 86
seconds latency
and
seconds per word or four bytes of buffered data on the
Thinking Machines CM-5. Meanwhile, with the low-latency paradigm, the
cost of an active message is only
seconds to transfer four
words of data.
In figure 7.22, we present speedup comparisons of the low-latency communications-based algorithm and the buffered communications-based algorithm. This series of graphs includes a comparison of overall speedup of the low-latency communications paradigm versus the buffered communications paradigm, and separate graphs of speedup for each of the three portions of the algorithm that have interprocessor communications. Overall, the small operations matrices have speedups for low-latency communications of greater than 28. The speedups for the larger matrices are smaller, because buffered communications are more efficient --- more data can be sent in each message, improving performance for this message type. Nevertheless, for 32 processors, speedups from between nine and thirteen have been measured for four iterations and a convergence check on these power systems network matrices.
Figure 7.22: Low-Latency Communications Speedup --- Parallel Gauss-Seidel
The other graphs in figure 7.22 illustrate the speedups for the algorithm components: