next up previous
Next: Comparison to Amdahl's Up: Preliminary test results Previous: Parallel Algorithm Performance

Processing Bottlenecks

The implementation of the parallel block-diagonal-bordered sparse matrix solver included communications to gather timing information for the node processors. Timing information was sent from the nodes to the host processor for display. These timing values permitted a visual check of parallel algorithm performance during run time and also provided data to analyze processing bottlenecks. The sample data was by definition load balanced, so disparities in this timing data are unquestionably a result of processing bottlenecks. The only area where processing bottlenecks are apparent in the data occurs in the accumulation of partial sums in the factorization and forward reduction steps.

Table 2 presents the minimum and maximum times reported by the nodes when generating the sparse vector of partial sums. The minimum values represent the time required to calculate the partial sums of update values, to generate the sparse vector of partial sums, and to perform the communications with the host node. The maximum values also include any time the slowest node processor incurred while waiting for the host processor to complete the communications of the sparse vector. There are two distinct bottlenecks illustrated in this data. The first bottleneck is a result of the time required to develop the sparse vector of partial sums, and the second bottleneck is caused by the sequential processing of the partial sums at the host node. All partial sums are calculated in parallel during this step before any communications are performed: then all node processors send the partial sums to the host processor which must both receive the messages and finish sequentially accumulating the partial sums.

Until later versions of the parallel block-diagonal-bordered sparse matrix solver perform the factorization of the last block in parallel, there is nothing that could affect the second bottleneck, however, the use of a different communications model that permits the efficient use of short messages can minimize the first bottleneck. The initial startup time --- represented by the values in the two columns of minimum processing times --- can be reduced to nearly zero, with the use of active messages. With active messages, several words can be sent between any processors on the CM-5 in less than 2 microseconds [22,28]. This is substantially less than the minimum times reported in table 2, so an active message construct in the algorithm should minimize the delay time that the host processor sits idle while waiting to begin processing.

 
Table 2: Timings for Processing Bottlenecks 

Active messages will also be a significant factor in future implementations when the data for the last block is distributed onto multiple processors. For such implementations, multiple messages would be required to send the partial sums to the appropriate node processors. When data for the last block is distributed onto multiple processors, minimizing the idle time of a single processor waiting for data from the first block will no longer be of primary concern. Rather for these implementations, the focus will be to minimize the total communications latency incurred because more, although, shorter messages will be required than in the present implementation. Active messages are specifically designed to have reduced communications latency, so both processing bottlenecks observed above can be minimized in those implementations that perform the last calculations in parallel and use active messages for communications.



next up previous
Next: Comparison to Amdahl's Up: Preliminary test results Previous: Parallel Algorithm Performance



David P. Koester
Sun Oct 22 16:27:33 EDT 1995