next up previous
Next: Performance Predictions for Up: Performance Predictions for Previous: Model Validation

Performance Predictions

In the near-future, we expect interprocessor communications for SPPs to improve significantly, with latency for buffered communications decreasing to levels that are available in MPPs like the Cray T3D today. We anticipate that buffered communications latency for SPPs, in the near future, will be only 1 second, with 100 megabyte-per-second bandwidths between individual processors [43]. Per-word communications costs for this architecture should be less than 0.04 second. In figure gif. we present actual and predicted speedup values for the complex LU factorization algorithm with the BCSPWR10 and EPRI6K power systems networks for

  1. empirical speedup data from the CM-5 implementation using buffered communications,
  2. predicted speedup for processor speeds and communications networks with 1 second latency and 100 megabyte-per-second bandwidth,
  3. predicted speedup for processor speeds and communications networks with 1 second latency and 1000 megabyte-per-second bandwidth.
The graphs in this figure show that we may see reduced speedup performance with this algorithm for processor speeds and communications networks with 1 second latency and 100 megabyte-per-second bandwidth. However, we may see improved speedup performance for a network that is 10 times faster. The slower network has a lower computation-to-communications ratio than the CM-5, but the faster network has a greater computation-to-communications ratio than the CM-5. The predicted performance for future architectures as seen in the two graphs presented in figure gif is similar to the predicted performance that has been observed in all data sets, and is a direct result of the scaling imposed by the constant of proportionality defined as a function of the processor and communications performance in equation gif.

 
Figure: Performance Predictions for Parallel Complex LU Factorization  

To fully understand the predicted speedup values, we will analytically examine how changes in the computation-to-communications ratio theoretically affect speedup on a future architecture. In order to be thorough, we will examine both sections of the parallel block-diagonal-bordered direct solver that have interprocessor communications:

  1. update the last diagonal block using the data in the borders ---
    ,
  2. factor the last diagonal block --- ,
and when appropriate, we will examine both implementations:
  1. low-latency communications,
  2. buffered communications.

We first examine updating the last diagonal block. For the low-latency communications version of the algorithm, we can't expect any improvement in the computation-to-communications ratio when updating the last diagonal block --- we expect individual processor performance to yield decreases in run times by a factor of 25 and would decrease only to 1.16 second from 1.6 second. Thus the computation-to-communication ratio would decrease, accentuating the effect of communications overhead. Meanwhile, for buffered communications, most communications messages are of moderate size, approximately 500 words, so we can expect that communications performance in this section of the algorithm would improve substantially, by as much as a factor of seven . Due to the limited number of moderate sized messages, performance in this portion of our parallel direct algorithm using buffered communications is bandwidth dependent. If communications latency decreases as significantly as we anticipate and the communications bandwidth increases as expected, the version of the algorithm to update the last diagonal block that would yield the best performance would be the buffered communications algorithm. Nevertheless, in spite of the dramatic communications improvement, would not keep pace with the performance improvement of individual SPP processors, .

The communications in the section of the CM-5 program that factors the last diagonal block uses active message s-copy commands, which have 23 second latency and 0.12 second per word communications costs [6]. Messages are also of moderate size, several hundred words, so we can expect that communications performance would improve by a factor of nearly five . This is also significantly less than the factor of 25 improvement in single processor performance. Due to the limited number of moderate sized messages, performance in this portion of our parallel direct algorithm is also bandwidth dependent.

If we combine the three portions of the speedup analysis: improvements of a factor of 25 for the processor speed, and improvements of five to seven in the communications speeds, it may not be possible to sustain the parallel speedup that we have obtained in this example program. Performance may be limited for 32 processors; however, strong performance with fewer processors may be sustainable, because communications overhead is not as great with fewer processors. Consequently, we should be able to obtain significant speedups with a single processor due to increased processor performance, while additional speedup due to parallelism will be less than we obtained in this research.

If communications bandwidths between individual processors for our future machine improved an order of magnitude, to a gigabyte-per-second, the prognosis for this algorithm would change because of bandwidth dependency in this algorithm. For gigabyte-per-second networks, communications to update the last-diagonal block could improve by a factor of 48 and communications to factor the last diagonal block could improve by a factor greater than 25 . As a result, the computation-to-communications ratio would be preserved, if not improved, and similar or better parallel speedups could be expected. This is clearly evident in the two graphs in figure gif.



next up previous
Next: Performance Predictions for Up: Performance Predictions for Previous: Model Validation



David P. Koester
Sun Oct 22 17:27:14 EDT 1995