In the near-future, we expect interprocessor communications for SPPs
to improve significantly, with latency for buffered communications
decreasing to levels that are available in MPPs like the Cray T3D
today. We anticipate that buffered communications latency for SPPs,
in the near future, will be only 1 second, with 100
megabyte-per-second bandwidths between individual processors
[43]. Per-word communications costs for this architecture
should be less than 0.04
second. In
figure
. we present actual and predicted
speedup values for the complex LU factorization algorithm with the
BCSPWR10 and EPRI6K power systems networks for
Figure: Performance Predictions for Parallel Complex LU Factorization
To fully understand the predicted speedup values, we will analytically examine how changes in the computation-to-communications ratio theoretically affect speedup on a future architecture. In order to be thorough, we will examine both sections of the parallel block-diagonal-bordered direct solver that have interprocessor communications:
We first examine updating the last diagonal block. For the
low-latency communications version of the algorithm, we can't expect
any improvement in the computation-to-communications ratio when
updating the last diagonal block --- we expect individual processor
performance to yield decreases in run times by a factor of 25 and
would decrease only to 1.16
second from 1.6
second. Thus the computation-to-communication ratio would
decrease, accentuating the effect of communications overhead.
Meanwhile, for buffered communications, most communications messages
are of moderate size, approximately 500 words, so we can expect that
communications performance in this section of the algorithm would
improve substantially, by as much as a factor of seven
. Due to the limited number
of moderate sized messages, performance in this portion of our
parallel direct algorithm using buffered communications is bandwidth
dependent. If communications latency decreases as significantly as we
anticipate and the communications bandwidth increases as expected, the
version of the algorithm to update the last diagonal block that would
yield the best performance would be the buffered communications
algorithm. Nevertheless, in spite of the dramatic communications
improvement,
would not keep pace with the performance
improvement of individual SPP processors,
.
The communications in the section of the CM-5 program that factors the
last diagonal block uses active message s-copy commands, which have 23
second latency and 0.12
second per word communications costs
[6]. Messages are also of moderate size, several hundred
words, so we can expect that communications performance would improve
by a factor of nearly five
. This is also significantly less than the factor of 25
improvement in single processor performance. Due to the limited
number of moderate sized messages, performance in this portion of our
parallel direct algorithm is also bandwidth dependent.
If we combine the three portions of the speedup analysis: improvements of a factor of 25 for the processor speed, and improvements of five to seven in the communications speeds, it may not be possible to sustain the parallel speedup that we have obtained in this example program. Performance may be limited for 32 processors; however, strong performance with fewer processors may be sustainable, because communications overhead is not as great with fewer processors. Consequently, we should be able to obtain significant speedups with a single processor due to increased processor performance, while additional speedup due to parallelism will be less than we obtained in this research.
If communications bandwidths between individual processors for our
future machine improved an order of magnitude, to a
gigabyte-per-second, the prognosis for this algorithm would change
because of bandwidth dependency in this algorithm. For
gigabyte-per-second networks, communications to update the
last-diagonal block could improve by a factor of 48 and communications to factor the
last diagonal block could improve by a factor greater than 25
. As a result,
the computation-to-communications ratio would be preserved, if not
improved, and similar or better parallel speedups could be expected.
This is clearly evident in the two graphs in
figure
.