The performance of the block-diagonal-bordered Choleski algorithm was tested with the following maximum diagonal block size for the respective matrices:
Graphs of speedup calculated from empirical performance data are provided in figures 29 through 31 for the three matrices. Each figure has a family of three curves that show speedup for:
: Speedup for BCSPWR09 Data --- 2, 4, and 8 processors
: Speedup for BCSPWR10 Data --- 2, 4, 8, and 16 processors
: Speedup for Niagara Mohawk Data --- 2, 4, 8, and 16 processors
For all three matrices, the performance data showed no significant difference between load balancing on the number of operations in factorization or the number of operations in the triangular solution. The time to factor, forward reduce, or backward substitute the data in the last diagonal block dominates the calculations for these matrices because of the low order of computational complexity in the diagonal block partitions of the matrix. These algorithms use pipelined communications that data-starve quickly with the small sizes of the last diagonal blocks. On the other hand, solutions with different values of the maximum diagonal block size yield larger last diagonal blocks. However, while better speedups are achieved, the actual run times are not improved.
The performance of the block-diagonal-bordered sparse Choleski solver using the BCSPWR10 matrix shows some improvement when compared to the solver performance on the BCSPWR09 matrix. The additional operations in the BCSPWR10 matrix permit improvements in speedup and efficiencies of 60% for four processors, although speedup remains nearly constant while additional processors are used. The Niagara Mohawk data offered a considerably larger matrix, although the number of calculations were less than the BCSPWR10 matrix and this matrix suffered from load imbalance even for four processors. The time to factor the mutually independent diagonal blocks showed that this portion of the calculations are embarassingly parallel for the two matrices from the Boeing-Harwell series. In the Niagara Mohawk data, the time to factor the the mutually independent blocks stays constant for four or more processors. As discussed above, attempts at selecting the maximum size of the diagonal blocks to minimize the load imbalance caused the size of the last diagonal block to increase so significantly that subsequent execution times were greater than for a matrix ordering that suffered with load imbalance.