next up previous contents
Next: Comparing Communications Paradigms Up: Empirical Results --- Previous: Examining Speedup

Analyzing Algorithm Component Performance

We next present a detailed analysis of the performance of the component parts of the parallel block-diagonal-bordered direct linear solver. We present graphs that show the time in milliseconds to perform each of the component operations of the algorithm:

  1. factor
  2. forward reduction
  3. backward substitution
This detailed analysis of the parallel algorithm will demonstrate that the pre-processing phase can effectively load balance the matrix for as many as 32 processors and illustrate some of the limitations of the algorithm for certain classes of data sets. We present the data for two separate power systems networks: BCSPWR09 --- a small, 1723 node network, from an operations application; and BCSPWR10 --- a larger, 5300 node network, from a planning application. The operations network empirical performance data is presented in figures 51, 52, and 53. The planning network empirical performance data is presented in figures 54, 55, and 56.

 
Figure 51: Algorithm Component Timing Data --- Double Precision LU Factorization --- BCSPWR09  

 
Figure 52: Algorithm Component Timing Data --- Double Precision Forward Reduction --- BCSPWR09  

 
Figure 53: Algorithm Component Timing Data --- Double Precision Backward Substitution --- BCSPWR09  

 
Figure 54: Algorithm Component Timing Data --- Double Precision LU Factorization --- BCSPWR10  

 
Figure 55: Algorithm Component Timing Data --- Double Precision Forward Reduction --- BCSPWR10  

 
Figure 56: Algorithm Component Timing Data --- Double Precision Backward Substitution --- BCSPWR10  

For factoring the operations network, figure 51 illustrates that the performance of factoring the diagonal blocks and updating the last diagonal block have no apparent load balancing overhead. Also, communications overhead is minimal when updating the last diagonal block. The curve representing the performance to factor the last block is nearly straight, with a slope that denotes nearly perfect parallelism --- relative speedup at each point is approximately equal to the number of processors. The curve representing the performance to update the last diagonal block is also nearly straight, although the slope of the curve shows that some overhead has occurred. On this log-log chart, the difference in slope is slight, Meanwhile, the curve representing the times to factor the last diagonal block show that this portion of the algorithm has poor performance --- speedups are no more that 1.84 for sixteen processors and performance declines for 32 processors. Fortunately, the preprocessing phase was able to partition the network and generate matrices where the number of operations to factor the last diagonal block is significantly less than the number of operations to factor the diagonal blocks or update the last diagonal block. The relative time required to factor the three algorithm components is approximately 4:2:1. Relative speedup of the overall implementation is affected by the limited speedup of the last block, not to the extent of Amdahl's Law [12], but there are performance limits to this algorithm. Amdahl's Law would limit speedup for this algorithm to approximately seven --- we have been able to achieve that speedup here, although the doubling of processors from 16 to 32 yields improvements in speedup from 6.30 to only 7.44.

For the triangular solutions, figures 52 and 53 show that we were able to get no speedup when performing the triangular solutions in the last diagonal block. The relative time required to reduce the algorithm components is approximately 4:3:2, so speedup for forward reduction would be limited to approximately 4.5 --- a value that the algorithm could not reach with 32 processors due to load imbalance overhead in the reduction of the diagonal blocks and when updating the last diagonal block. The relative time required to back solve for the algorithm components is approximately 4:1, so speedup for backward substitution would be limited to approximately 5.0 --- a value that the algorithm also could not attain due to additional overhead. Both triangular solution algorithms suffered load imbalance overhead, which was slight and not unexpected. We distributed the data to processors as a function of balanced computational load for factorization. Dense factorization has floating point operations, while dense triangular solutions have only floating point operations. The sparse matrices associated with these power systems networks have significantly lower orders of computational complexity for the two components; however, factorization still has more calculations per row than triangular solves. As a result, some load imbalance overhead has been encountered in these algorithms.

We next examine parallel algorithm performance for a larger power systems network, BCSPWR10, that has four times the number of rows/columns and over eleven times the number of floating point operations. Figure 51 illustrates that the performance of factoring the diagonal blocks and updating the last diagonal block have little apparent load balancing overhead and communications overhead when updating the last diagonal block is minimal. Relative speedups are 29.9 for factoring the diagonal blocks on 32 processors and 21.6 for updating the last diagonal block on 32 processors. Performance for factoring the last diagonal block shows great improvement for this planning matrix when compared to the small operations matrix, BCSPWR09. While there is no measurable speedup for two processors due to the pipelined-nature of the algorithm, parallel performance improves respectably for larger numbers of processors. The timing data in figure 51 correspond to speedups of 4.9 for factoring the last diagonal block on 32 processors. The relative workload on a single processor is 2:6:5 for factoring this matrix. The extensive amount of operations to update and factor the last block make it imperative that good speedups have been obtained in these algorithm sections --- in spite of the fact that both algorithm sections contain communications. The relative speedup obtained for factoring this matrix is 9.4 for 32 processors.

Performance of the triangular solvers on this larger, planning matrix are more promising than for the operations matrix. Figures 55 and 56 show that we were able to get limited speedup when performing the triangular solutions in the last diagonal block. The relative time required to reduce the algorithm components is approximately 1:1:1, so speedup for forward reduction would be limited to approximate 3 under Amdahl's law --- a value that the algorithm was able to exceed, 3.6, with 32 processors. The relative time required to back solve for the algorithm components is approximately 2:1, so backward substitution speedup would be limited to approximately 3.0 --- a value that the algorithm also did surpass, although only slightly. Both triangular solution algorithms suffered nearly no load imbalance overhead for this larger power systems network, in spite of the fact that we distributed the data to processors as a function of balanced computational load for factorization.

We have conducted similar detailed examinations into the performance of the algorithm for the three other power systems networks, and have obtained similar results. We draw the following conclusions from this detailed examination of the algorithm components:



next up previous contents
Next: Comparing Communications Paradigms Up: Empirical Results --- Previous: Examining Speedup



David P. Koester
Sun Oct 22 15:31:10 EDT 1995