Performance of our block-diagonal-bordered Choleski solver will be analyzed with three separate power systems matrices:
A detailed ordering analysis has been performed on the BCSPWR09 data
to illustrate the ability of the node-tearing ordering algorithm. To
contrast the performance of this ordering technique, figure 17
illustrates a minimum degree ordering of the BCSPWR09 matrix without
ordering. Note that the matrix is the most sparse in the upper
left-hand corner, while the matrix is less sparse in the lower
right-hand corner. When factoring this matrix, the number of zero
values that become non-zero while factoring the matrix, is 2,168.
Original nonzero values are represented in this figure in black,
fillin locations are represented in gray, and all remaining zero
values are white. A bounding box has been placed around the sparse
matrix. Our block-diagonal-bordered Choleski solver requires that the
BCSPWR09 matrix be ordered into block-diagonal-bordered form with
uniformly distributed workload at each of the processors. A single
specified input parameter, the maximum partition size, defines the
shape of the matrix after ordering using the node-tearing algorithm.
Examples of applying the node-tearing algorithm to the BCSPWR09 matrix
are presented in figures 18, 19, and 20
respectively for maximum diagonal block sizes () of 32, 64,
and 128 nodes. The number of fillin for these block-diagonal-bordered
form matrices is 2,765, 3,248, and 3,838 respectively, with a
substantial portion of the fillin occurring in the lower right-hand
corner, or the last diagonal-block.
: Minimum Degree Ordering --- BCSPWR09
: Node-Tearing-Based Ordering --- BCSPWR09 ---
: Node-Tearing-Based Ordering --- BCSPWR09 ---
: Node-Tearing-Based Ordering --- BCSPWR09 ---
The general performance of node-tearing-based ordering is dependent on the maximum number of nodes in a diagonal block, and the intended number of processors to which the mutually independent diagonal blocks will be distributed. The number of coupling equations and the size of the last block is dependent only on the maximum number of nodes in a diagonal block, which is illustrated in figure 21. Note that the maximum size of the diagonal blocks is inversely related to the size of the last diagonal block. This is intuitive, because as diagonal matrix blocks are permitted to grow larger, multiple smaller blocks can be incorporated into a single block. Not only will the two blocks be consolidated into the single block, but in addition, any elements in the coupling equations that are unique to those network partitions could also be moved into the larger block. Another interesting point with the relationship between maximum size of the diagonal block and the size of the last block, is that the percentage of non-zeros and fillin in the last diagonal block increases significantly as the size of the last block decreases. This makes the use of parallel dense Choleski factorization techniques even more desirable for the last diagonal block when the selection of the maximum size of the diagonal blocks is chosen to minimize the size of the last diagonal block.
It is desirable to minimize the size of the last block, but this can cause load imbalance as the number of processors increases. Load imbalance is defined as the ratio of the difference of the maximum and minimum numbers of floating point operations divided by the maximum number of floating point operations per processor and illustrates the percentage of time that the processor with the least amount of work is inactive. Families of curves illustrating load imbalance as a function of the number of processors is presented in figure 22. The load imbalance data presented in this figure are calculated only for the number of operations required to calculate the Choleski factor of the matrix. For all ordering with different size diagonal blocks, there is perfect load-balancing for four processors, however, as the number of processors increases, so does load imbalance.
: Node-Tearing --- Nodes in Last Block for BCSPWR09
: Node-Tearing --- Load Imbalance for Factoring BCSPWR09
For the BCSPWR09 matrix, we have selected a maximum diagonal block size of 128 as the trade-off between load balance and minimum size of the last block. Figure 23 depicts a reordered version of the matrix presented in figure 20 with diagonal blocks assigned to a processor ordered to make those blocks contiguous. The ordering of the matrix to reflect contiguous diagonal blocks within each processor is performed by simple row and column exchanges, and has no effect on either the positive definite nature of the matrix or on the number of fillin. This is just a modification to the permutation matrix used in equation 5. In this figure, the measure of merit for the pigeon-hole load balancing algorithm is the number of factorization (multiply and addition) operations. In contrast, figure 24 depicts the same matrix after node-tearing with the exception that the load balancing has been performed as a function of the number of multiply and addition operations encountered in forward reduction and backward substitution. Note the obvious differences in block assignments to processors P2 and P4. In figures 23 and 24, the assignments of matrix partitions to processors have been denoted. Figure 25 presents plots of load imbalance verses numbers of processors for the BCSPWR09 matrix ordered with a maximum of 128 nodes. Two load imbalance curves are presented --- one for load balancing on the number of calculations in the factorization step and the second curve represents load-imbalance for load balancing on the number of calculations in the triangular solution phase. The number of processors range from two through 32 and it is clear that the load imbalance becomes significant for greater than eight processors. This graph shows that there is only a significant amount of additional load imbalance encountered for eight processors. Load imbalance for other numbers of processors is quite similar.
: BCSPWR09 --- Load Balancing on Factoring --- 4 Processors
: BCSPWR09 --- Load Balancing on Triangular Solutions --- 4 Processors
: Load Imbalance for Factoring BCSPWR09
Similar ordering analyses have been performed for the BCSPWR10 and Niagara Mohawk load flow matrices. Examples of these matrices after applying the node-tearing algorithm and ordering both into block-diagonal-bordered form and for load-balancing are presented in figures 26 through 28. Figure 26 illustrates the block-diagonal-bordered form achievable when ordering the BCSPWR10 matrix with maximum block size of 256, while figures 27 and 28 illustrate the block-diagonal-bordered matrix form achievable when ordering the Niagara Mohawk data for maximum block sizes of 256 and 512 nodes respectively. The number of graph nodes or diagonal elements in the BCSPWR10 and Niagara Mohawk matrices are 5300 and 9430 respectively. The BCSPWR10 matrix in figure 26 has 476 equations in the lower border and last block while the Niagara Mohawk matrices in figures 27 and 28 have 402 and 313 equations in the coupling equations respectively. These graphs illustrate load balancing for four processors based on the number of calculations to perform the factorization. The Niagara Mohawk power system network graphs are similar to the BCSPWR09 power system graph in that they can be readily ordered into block-diagonal-bordered form, there are some interesting and substantial differences. Summary statistics are presented in table 1 for the three matrices and various maximum diagonal block sizes.
: BCSPWR10 --- Load Balancing on Factoring --- --- 4 Processors
: Niagara Mohawk Data --- Load Balancing on Factoring --- --- 4 Processors
: Niagara Mohawk Data --- Load Balancing on Factoring --- --- 4 Processors
The BCSPWR10 matrix has substantially more edges per node than the
BCSPWR09 matrix, which results in a significantly greater
computational complexity for the factorization of the matrix.
Meanwhile, the computational complexity of the triangular solution
is similar to the BCSPWR09 matrix. Greater
computational complexity means more work, and if that work can be
distributed uniformly to all processors, then there is the potential
for greater parallel implementation efficiency. Our goal is to
develop scalable algorithms for Choleski factorization; however,
parallelism in sparse Choleski algorithms is heavily dependent on the
available parallelism in the actual sparse matrix. In
section 8.3, it will be shown that the additional
computations in BCSPWR10 (when compared to BCSPWR09) permits
reasonable efficiencies when factoring this matrix with four
processors. Meanwhile, the solver for BCSPWR09 simply does not have
adequate calculations to overcome the effects of limited amounts of
calculations.
The matrix from Niagara Mohawk, labeled NiMo in
table 1, has similar computational complexity as BCSPWR09,
however, it has nearly times as many diagonal elements and
times as many non-zero elements and fillin. Careful examination of
figures 27 and 28 show that for four processors, the
load balancing has placed a single diagonal block on one of four
processors. This block has only 247 nodes, however, it has 62% of
the factorization operations. The computation complexity of this
block is
for factorization and
for the
triangular solution phase. It is possible to divide that diagonal
matrix block by decreasing the maximum size of the diagonal blocks,
Unfortunately, when the maximum size of a diagonal block is reduced
from 247 to 128, two situations arise that generate less than optimal
solutions. First, the size of the last block increases by 77%, which
increases the number of computations in the last block by over 550%.
While the last diagonal block is solved in parallel, the pipelined
parallel dense Choleski solver used is not sufficiently scalable to be
able to overcome such a large increase in number of operations by
applying additional processors. Second, even when the size of the
diagonal blocks are limited to a size that forces this block in
question to be divided into separate matrices, the number of
operations in a single block remains 31% of the total number of
operations in the diagonal blocks and borders. Load imbalance would
occur at greater than three processors.
Further examination of table 1 shows that the computational complexity of the forward reduction and backward substitution phases are similar in magnitude regardless of the choice of matrix. It is difficult to obtain good speedup on large multi-processors for algorithms with strong precedence in the calculations, especially when the computational complexity is so close to unity.