next up previous
Next: Conclusions Up: Empirical Results Previous: Empirical Results ---

Empirical Results --- Sparse Choleski Solver

The performance of the block-diagonal-bordered Choleski algorithm was tested with the following maximum diagonal block size for the respective matrices:

Examples of each of these block-diagonal-bordered sparse matrices were presented in section 8.2 with load balancing for four processors. These ordered matrices have been used because they offered the best performance for a particular load-flow matrix. Performance data has been collected for individual operations to examine the speedup of each distinct portion of the algorithm. The data was collected in such a manner as to have no impact on the overall measures of performance. Performance data has been collected while running the sparse block-diagonal-bordered solver on from one to sixteen CM-5 node processors. Performance of the multi-processor algorithms are illustrated using graphs plotting relative speedup versus the number of processors. Relative speedup has been defined above in section 8.1.

Graphs of speedup calculated from empirical performance data are provided in figures 29 through 31 for the three matrices. Each figure has a family of three curves that show speedup for:

  1. Choleski factorization
  2. forward reduction and backward substitution
  3. factorization and a single forward reduction and backward substitution
In general these speedup graphs illustrate that for the real power system load-flow matrices from the Boeing-Harwell series and the Niagara Mohawk data, speedup appears to be limited to approximately 2.7 regardless of the number of processors used to solve the problem. Careful analysis of the performance data shows that the following data trends.

 


: Speedup for BCSPWR09 Data --- 2, 4, and 8 processors  

 


: Speedup for BCSPWR10 Data --- 2, 4, 8, and 16 processors  

 


: Speedup for Niagara Mohawk Data --- 2, 4, 8, and 16 processors  

For all three matrices, the performance data showed no significant difference between load balancing on the number of operations in factorization or the number of operations in the triangular solution. The time to factor, forward reduce, or backward substitute the data in the last diagonal block dominates the calculations for these matrices because of the low order of computational complexity in the diagonal block partitions of the matrix. These algorithms use pipelined communications that data-starve quickly with the small sizes of the last diagonal blocks. On the other hand, solutions with different values of the maximum diagonal block size yield larger last diagonal blocks. However, while better speedups are achieved, the actual run times are not improved.

The performance of the block-diagonal-bordered sparse Choleski solver using the BCSPWR10 matrix shows some improvement when compared to the solver performance on the BCSPWR09 matrix. The additional operations in the BCSPWR10 matrix permit improvements in speedup and efficiencies of 60% for four processors, although speedup remains nearly constant while additional processors are used. The Niagara Mohawk data offered a considerably larger matrix, although the number of calculations were less than the BCSPWR10 matrix and this matrix suffered from load imbalance even for four processors. The time to factor the mutually independent diagonal blocks showed that this portion of the calculations are embarassingly parallel for the two matrices from the Boeing-Harwell series. In the Niagara Mohawk data, the time to factor the the mutually independent blocks stays constant for four or more processors. As discussed above, attempts at selecting the maximum size of the diagonal blocks to minimize the load imbalance caused the size of the last diagonal block to increase so significantly that subsequent execution times were greater than for a matrix ordering that suffered with load imbalance.



next up previous
Next: Conclusions Up: Empirical Results Previous: Empirical Results ---



David P. Koester
Sun Oct 22 15:40:25 EDT 1995