next up previous
Next: Processing Bottlenecks Up: Preliminary test results Previous: Test Matrices

Parallel Algorithm Performance

Parallel algorithm performance was measured using 32 processors, the default number of processors on NPAC's Thinking Machines CM-5. The number of independent blocks, number of equations per matrix, the number of non-zero values, and the percentage of non-zero values are presented in table 1. The data used to test the algorithms should exhibit linear timing performance for factorization as the number of equations is increased, because the performance for sequential and parallel factorization can be described as the sum of multiple terms where the only variables in the equations are , the number of independent blocks, and , the number of independent blocks assigned to each processor.

 

 

 where: ¯

is the total factorization time for the sequential algorithm.

is the total factorization time for the parallel algorithm.

is the number of independent blocks.

is the number of independent blocks per processor.

is the time to factor an independent block.

is the time to calculate the partial sums of updates in the borders.

is the time to communicate the partial sums to the host node.

is the number of processors.

is the time to update values using the partial sums.

is the time to factor the last block.

All values in these equations are essentially constants, except for and . Given the equations to predict processing time for both the sequential and parallel factorization with this data, run-time algorithm performance is expected to be proportional to , the number of blocks. There are a constant number of equations per block, so the time to factor these test matrices should be proportional to the total number of equations or the order of the matrix. The computational complexity to factor an independent block, , in equations 17 and 18 is where is the number of equations in the independent sub-matrix. As long as remains constant, the time to factor the entire matrix should grow at a constant rate.

Figure 20 presents the wall-clock time required to factor test matrices with the sequential algorithm running only on the CM-5 host processor and the parallel algorithm running on the CM-5 host processor and the 32 node processors. The time required to factor these test matrices is linear as predicted for both the sequential and parallel algorithms. Classes of test matrices with linear growth in the number of calculations provide conservative estimates of speedup and efficiency for parallel algorithms when compared to classes of test matrices that have numbers of calculations that increase at polynomial rates. These test matrices add independent blocks of constant size rather than enlarging the independent blocks. Consequently, this class of test matrices has linear growth in the number of calculations as a function of the matrix order, in contrast to classes of test matrices where the size of sub-blocks is increased and the numbers of calculations would increase at a polynomial rate. For the class of test matrices with enlarged sub-blocks, improved performance could be a function of the increasing number of calculations that are available to be performed in parallel. The test matrices used throughout this preliminary examination of block-diagonal-bordered sparse matrix factorization algorithms exhibit good correlations of these theoretical predictions with empirical data,

 
Table 1: Sample Matrix Sizes 

 
Figure 20: Linear Performance of Block-Bordered-Diagonal Sparse Matrix Factorization 

The empirical data on the performance of the parallel block-diagonal-bordered sparse matrix solver described in section 5 has been collected using the 32 processor NPAC CM-5 for the block-diagonal-bordered sparse matrix factorization algorithms implemented in C using the CMMD version 2.0 library of message passing functions. Speedup has been calculated for the five sample matrices, and is graphically presented in figure 21. This graph plots speedup versus the number of equations in the matrix for

Likewise, efficiency is graphically represented in figure 22. For factorization, speedup ranges from 10.2 to 27.5 for the 32 processor NPAC CM-5 which corresponds to efficiency values ranging from 32% to 86%.

Parallel factorization performance is rather good; however, forward reduction and backward substitution performance is not as promising for this implementation. Speedup for forward reduction ranges from 0.5 to 2.0 and more significantly, the actual times for parallel factorization and parallel forward substitution are nearly equal. For a matrix with 32832 equations, the ratio of the time for factorization to the time for forward reduction is 3.5. For 2112 equations, the time ratio is only 2.3. Performance difficulties with the parallel forward reduction algorithm are caused by fewer calculations in the forward reduction step than the factorization step, and inadequate calculations exist to overcome the communications overhead. Additional research will address faster communications techniques to attempt to minimize this problem,

 
Figure 21: Speedup for the Block Bordered Diagonal Form Sparse Matrix Solver 

 
Figure 22: Efficiency for the Block Bordered Diagonal Form Sparse Matrix Solver 

Meanwhile, speedup for parallel backward substitution is less than for parallel factorization, although somewhat better than for parallel forward reduction. The broadcast communications step in parallel backward substitution is more efficient than the accumulation step in parallel forward reduction, primarily because there are less sequential operations involved. Parallel backward substitution performance yielded speedups of 1.3 to 19.2 that correspond to efficiency values of 4% to 60%. Parallel performance tends to increase rather rapidly as larger matrices are processed. The empirical data collected for backward substitution is not monotonically increasing, as is all other speedup curves in figure 21. The probable explanation for this phenomenon is that measured performance for the parallel algorithm at this point was corrupted by processor contention. Given the small actual run time, a small increase in execution time would drastically effect this measurement. There is no reason to believe that performance for backward substitution would not be follow a curve fitted to the other data points.

While the speedups for parallel forward reduction and parallel backward substitution are less than the speedup for factorization, the performance for factoring the matrix and a single forward reduction and a single backward substitution step is still reasonably good. Speedup for the solution of a single b vector ranges from 5.4 to 19.9 with corresponding efficiency values of 17% to 62%.



next up previous
Next: Processing Bottlenecks Up: Preliminary test results Previous: Test Matrices



David P. Koester
Sun Oct 22 16:27:33 EDT 1995