next up previous contents
Next: Conclusions Up: Empirical Results --- Previous: Algorithm Performance on

Algorithm Performance on Future SPP Architectures

While we design and implement algorithms on existing hardware, it is desirable to predict algorithm performance for future architectures. We can expect future SPPs to be similar to the IBM SPx series with features approaching the Cray T3D massively parallel processor (MPP) [28]. When comparing the single processor performance of the CM-5 (a 33 MHz Sparc microprocessor from Sun Microsystems [3]) with the node of an SP1 or SP2 (a 62.5 MHz IBM RS/6000 model 370 four command superscalar microprocessor), the RS/6000 is 6.6 times faster today when comparing empirical data from our algorithm run on a single processor. In the near-future, it will feasible to get four times the individual processor power that we see today, so it is conceivable that the (near-term) next generation of SPP microprocessors will be 25 times as fast as today's microprocessors. Some of this power may come from placing multiple processors per SPP node,

If SPP node processor capability increases by a factor of 25, communications capabilities must improve by at least as much if the performance of parallel sparse direct linear solvers for power systems applications are to have equal or better performance on multiple processors. In other words, the communications-to-calculations ratio or granularity must remain constant. As we analyze communications performance of the parallel sparse direct solver, we must look at the two portions of the factorization algorithm that include communications: updating the last block and factoring the last block.

We implemented two versions of the parallel software on the CM-5 that updated the last diagonal block: with active message remote procedure calls (RPCs), and with non-blocking buffered communications. The active message based communications has latency of 1.6 second for an RPC, which transmits a data payload of four words. The non-blocking buffered communications version of the algorithm utilized the CMMD communications library, which has 86 second latency and 0.12 second per word communications costs [3]. The CM-5 has a multi-tiered communications network with 40 megabytes-per-second bandwidth at the lowest layer [3]. The IBM SP2 has a 30 second latency and 30 megabyte-per-second bandwidth in present configurations. In the near-future, we expect interprocessor communications for SPPs to improve significantly, with latency for buffered communications decreasing to levels that are available in MPPs like the Cray T3D today. We anticipate that buffered communications latency for SPPs, in the near future, will be only one second, with 100 megabytes-per-second bandwidths between individual processors [28]. Per-word communications costs for this architecture should be less than 0.04 second. For the active message communications version of updating the last diagonal block, we can't expect much improvement; however, for buffered communications, most communications messages are short, less than a kilobyte, so we can expect that communications performance in this section of the algorithm would improve by a factor of five. If communications latency decreases as significantly as we anticipate, the version of the algorithm to update the last diagonal block that would yield the best performance would be the buffered communications algorithm, but not keep pace with the performance improvement of individuals SPP processors.

The communications in the section of the CM-5 program that factors the last diagonal block uses active message s-copy commands, which have 23 second latency and 0.12 second per word communications costs [3]. Messages are short, again about a kilobyte, so we can expect that communications performance would improve by a factor of nearly four.

If we combine the three portions of the speedup analysis: improvements of a factor of 25 for the processor speed, and improvements of four or five in the communications speeds, it may not be possible to sustain the parallel speedup that we have obtained in this example program. Performance may be limited for 32 processors, however, strong performance with lessor numbers of processors should be sustainable, because communications overhead is not as great. Consequently, we should be able to obtain strong speedups with a single processor due to increased processor performance, while additional speedup due to parallelism may see less, although, multi-processor speedup should not decline to levels less than those speedups achieved for Choleski factorization.

If communications bandwidths between individual processors for our future machine improved an order of magnitude, to a gigabyte-per-second, the prognosis for this algorithm would change. For gigabyte-per-second networks, communications to update the last-diagonal block could improve by a factor of 40 and communications to factor the last diagonal block could improve by a factor greater than 25. As a result, the computation-to-communications ratio would be preserved, if not improved, and similar or better parallel speedups could be expected.

This research was inspired by the low latency communications possible using active messages on the CM-5. We believe that SPP architectures, like the IBM SP2, may eventually provide similar low-latency communications for short messages because there are many parallel algorithms that can only be implemented efficiently with this type of interprocessor communications support. SPP hardware developers recognize that low-latency communications increase the utility of their computer and, consequently, improve market potential. The research community also recognizes this fact and university research will keep pressure on hardware developers to provide lower latency communications and higher interconnection bandwidths.



next up previous contents
Next: Conclusions Up: Empirical Results --- Previous: Algorithm Performance on



David P. Koester
Sun Oct 22 15:31:10 EDT 1995