We next present a detailed analysis of the performance of the component parts of the parallel block-diagonal-bordered direct linear solver. We present graphs that show the time in milliseconds to perform each of the component operations of the algorithm:
Figure 51: Algorithm Component Timing Data --- Double Precision LU Factorization --- BCSPWR09
Figure 52: Algorithm Component Timing Data --- Double Precision Forward Reduction --- BCSPWR09
Figure 53: Algorithm Component Timing Data --- Double Precision Backward Substitution --- BCSPWR09
Figure 54: Algorithm Component Timing Data --- Double Precision LU Factorization --- BCSPWR10
Figure 55: Algorithm Component Timing Data --- Double Precision Forward Reduction --- BCSPWR10
Figure 56: Algorithm Component Timing Data --- Double Precision Backward Substitution --- BCSPWR10
For factoring the operations network, figure 51 illustrates that the performance of factoring the diagonal blocks and updating the last diagonal block have no apparent load balancing overhead. Also, communications overhead is minimal when updating the last diagonal block. The curve representing the performance to factor the last block is nearly straight, with a slope that denotes nearly perfect parallelism --- relative speedup at each point is approximately equal to the number of processors. The curve representing the performance to update the last diagonal block is also nearly straight, although the slope of the curve shows that some overhead has occurred. On this log-log chart, the difference in slope is slight, Meanwhile, the curve representing the times to factor the last diagonal block show that this portion of the algorithm has poor performance --- speedups are no more that 1.84 for sixteen processors and performance declines for 32 processors. Fortunately, the preprocessing phase was able to partition the network and generate matrices where the number of operations to factor the last diagonal block is significantly less than the number of operations to factor the diagonal blocks or update the last diagonal block. The relative time required to factor the three algorithm components is approximately 4:2:1. Relative speedup of the overall implementation is affected by the limited speedup of the last block, not to the extent of Amdahl's Law [12], but there are performance limits to this algorithm. Amdahl's Law would limit speedup for this algorithm to approximately seven --- we have been able to achieve that speedup here, although the doubling of processors from 16 to 32 yields improvements in speedup from 6.30 to only 7.44.
For the triangular solutions, figures 52
and 53 show that we were able to get no speedup when
performing the triangular solutions in the last diagonal block. The
relative time required to reduce the algorithm components is
approximately 4:3:2, so speedup for forward reduction would be limited
to approximately 4.5 --- a value that the algorithm could not reach with
32 processors due to load imbalance overhead in the reduction of the
diagonal blocks and when updating the last diagonal block. The
relative time required to back solve for the algorithm components is
approximately 4:1, so speedup for backward substitution would be
limited to approximately 5.0 --- a value that the algorithm also could
not attain due to additional overhead. Both triangular solution
algorithms suffered load imbalance overhead, which was slight and not
unexpected. We distributed the data to processors as a function of
balanced computational load for factorization. Dense factorization
has floating point operations, while dense triangular
solutions have only
floating point operations. The sparse
matrices associated with these power systems networks have
significantly lower orders of computational complexity for the two
components; however, factorization still has more calculations per row
than triangular solves. As a result, some load imbalance overhead has
been encountered in these algorithms.
We next examine parallel algorithm performance for a larger power systems network, BCSPWR10, that has four times the number of rows/columns and over eleven times the number of floating point operations. Figure 51 illustrates that the performance of factoring the diagonal blocks and updating the last diagonal block have little apparent load balancing overhead and communications overhead when updating the last diagonal block is minimal. Relative speedups are 29.9 for factoring the diagonal blocks on 32 processors and 21.6 for updating the last diagonal block on 32 processors. Performance for factoring the last diagonal block shows great improvement for this planning matrix when compared to the small operations matrix, BCSPWR09. While there is no measurable speedup for two processors due to the pipelined-nature of the algorithm, parallel performance improves respectably for larger numbers of processors. The timing data in figure 51 correspond to speedups of 4.9 for factoring the last diagonal block on 32 processors. The relative workload on a single processor is 2:6:5 for factoring this matrix. The extensive amount of operations to update and factor the last block make it imperative that good speedups have been obtained in these algorithm sections --- in spite of the fact that both algorithm sections contain communications. The relative speedup obtained for factoring this matrix is 9.4 for 32 processors.
Performance of the triangular solvers on this larger, planning matrix are more promising than for the operations matrix. Figures 55 and 56 show that we were able to get limited speedup when performing the triangular solutions in the last diagonal block. The relative time required to reduce the algorithm components is approximately 1:1:1, so speedup for forward reduction would be limited to approximate 3 under Amdahl's law --- a value that the algorithm was able to exceed, 3.6, with 32 processors. The relative time required to back solve for the algorithm components is approximately 2:1, so backward substitution speedup would be limited to approximately 3.0 --- a value that the algorithm also did surpass, although only slightly. Both triangular solution algorithms suffered nearly no load imbalance overhead for this larger power systems network, in spite of the fact that we distributed the data to processors as a function of balanced computational load for factorization.
We have conducted similar detailed examinations into the performance of the algorithm for the three other power systems networks, and have obtained similar results. We draw the following conclusions from this detailed examination of the algorithm components: