We next present a detailed analysis of the performance of the component parts of the parallel block-diagonal-bordered direct linear solver. We present graphs that show the time in milliseconds to perform each of the component operations of the algorithm:
Figure 7.7: BCSPWR09 --- Algorithm Component Timing Data
Figure 7.8: BCSPWR10 --- Algorithm Component Timing Data
For factoring the operations network, the respective graph in figure 7.7 illustrates that factoring the diagonal blocks and updating the last diagonal block have no apparent load balancing overhead. Also, communications overhead is minimal when updating the last diagonal block. The curve representing the performance to factor the diagonal blocks is nearly straight, with a slope that denotes nearly perfect parallelism --- relative speedup at each point is approximately equal to the number of processors. The curve representing the performance to update the last diagonal block is also nearly straight, although the slope of the curve shows that some overhead has occurred. On this log-log chart, the difference in slope is slight. Meanwhile, the curve representing the times to factor the last diagonal block show that this portion of the algorithm has poor performance --- speedups are no more that 1.84 for sixteen processors and performance shows no improvement for 32 processors. Fortunately, the preprocessing phase was able to partition the network and generate matrices where the number of operations to factor the last diagonal block is significantly less than the number of operations to factor the diagonal blocks or update the last diagonal block.
For the triangular solutions, the respective graphs in figure 7.7 show that we were able to get no speedup when performing the triangular solutions in the last diagonal block. Both triangular solution algorithms suffered load imbalance overhead, which was slight and not unexpected. We distributed the data to processors as a function of balanced computational load for factorization. The sparse matrices associated with these power systems networks have significantly lower orders of computational complexity for the two components; however, factorization still has more calculations per row than triangular solves. As a result, some load imbalance overhead has been encountered in these algorithms.
We next examine parallel algorithm performance for a larger power systems network, BCSPWR10, that has four times the number of rows/columns and over eleven times the number of floating point operations. The graph for LU factorization in figure 7.8 illustrates that the performance of factoring the diagonal blocks and updating the last diagonal block have little apparent load balancing overhead and communications overhead is minimal when updating the last diagonal block. Relative speedups are 29.9 for factoring the diagonal blocks on 32 processors and 21.6 for updating the last diagonal block on 32 processors. Performance for factoring the last diagonal block shows great improvement for this planning matrix when compared to the small operations matrix, BCSPWR09. While there is no measurable speedup for two processors due to the pipelined-nature of the algorithm, parallel performance improves respectably for larger numbers of processors. The timing data for LU factorization in figure 7.8 correspond to speedups of 4.9 for factoring the last diagonal block on 32 processors. The extensive amount of operations to update and factor the last block make it imperative that good speedups have been obtained in these algorithm sections --- in spite of the fact that both algorithm sections contain communications. The relative speedup obtained for factoring this matrix is 9.4 for 32 processors.
Performance of the triangular solvers on this larger, planning matrix is more promising than for the operations matrix. The respective graphs in figure 7.8 show that we were able to get limited speedup when performing the triangular solutions in the last diagonal block. Both triangular solution algorithms suffered nearly no load imbalance overhead for this larger power systems network, in spite of the fact that we distributed the data to processors as a function of balanced computational load for factorization.
We have conducted similar detailed examinations into the performance of the algorithm for the three other power systems networks, and have obtained similar results. We draw the following conclusions from this detailed examination of the parallel direct algorithm components: