next up previous
Next: Empirical Results --- Up: Empirical Results Previous: Empirical Results ---

Empirical Results --- Ordering

Performance of our block-diagonal-bordered Choleski solver will be analyzed with three separate power systems matrices:

Matrices BCSPWR09 and BCSPWR10 are from the Boeing Harwell series and represent real electrical power system networks from the Western and Eastern US respectively. These matrices have an edge-to-node-ratio of approximately 2.5. The data obtained from the Niagara Mohawk Power Corporation is substantially sparser, with an edge-to-node-ratio of only 1.5. While this data has more nodes than either of the Boeing-Harwell matrices, the reduced sparsity equates to lesser calculations than the BCSPWR10 matrix.

A detailed ordering analysis has been performed on the BCSPWR09 data to illustrate the ability of the node-tearing ordering algorithm. To contrast the performance of this ordering technique, figure 17 illustrates a minimum degree ordering of the BCSPWR09 matrix without ordering. Note that the matrix is the most sparse in the upper left-hand corner, while the matrix is less sparse in the lower right-hand corner. When factoring this matrix, the number of zero values that become non-zero while factoring the matrix, is 2,168. Original nonzero values are represented in this figure in black, fillin locations are represented in gray, and all remaining zero values are white. A bounding box has been placed around the sparse matrix. Our block-diagonal-bordered Choleski solver requires that the BCSPWR09 matrix be ordered into block-diagonal-bordered form with uniformly distributed workload at each of the processors. A single specified input parameter, the maximum partition size, defines the shape of the matrix after ordering using the node-tearing algorithm. Examples of applying the node-tearing algorithm to the BCSPWR09 matrix are presented in figures 18, 19, and 20 respectively for maximum diagonal block sizes () of 32, 64, and 128 nodes. The number of fillin for these block-diagonal-bordered form matrices is 2,765, 3,248, and 3,838 respectively, with a substantial portion of the fillin occurring in the lower right-hand corner, or the last diagonal-block.

 


: Minimum Degree Ordering --- BCSPWR09  

 


: Node-Tearing-Based Ordering --- BCSPWR09 ---  

 


: Node-Tearing-Based Ordering --- BCSPWR09 ---  

 


: Node-Tearing-Based Ordering --- BCSPWR09 ---  

The general performance of node-tearing-based ordering is dependent on the maximum number of nodes in a diagonal block, and the intended number of processors to which the mutually independent diagonal blocks will be distributed. The number of coupling equations and the size of the last block is dependent only on the maximum number of nodes in a diagonal block, which is illustrated in figure 21. Note that the maximum size of the diagonal blocks is inversely related to the size of the last diagonal block. This is intuitive, because as diagonal matrix blocks are permitted to grow larger, multiple smaller blocks can be incorporated into a single block. Not only will the two blocks be consolidated into the single block, but in addition, any elements in the coupling equations that are unique to those network partitions could also be moved into the larger block. Another interesting point with the relationship between maximum size of the diagonal block and the size of the last block, is that the percentage of non-zeros and fillin in the last diagonal block increases significantly as the size of the last block decreases. This makes the use of parallel dense Choleski factorization techniques even more desirable for the last diagonal block when the selection of the maximum size of the diagonal blocks is chosen to minimize the size of the last diagonal block.

It is desirable to minimize the size of the last block, but this can cause load imbalance as the number of processors increases. Load imbalance is defined as the ratio of the difference of the maximum and minimum numbers of floating point operations divided by the maximum number of floating point operations per processor and illustrates the percentage of time that the processor with the least amount of work is inactive. Families of curves illustrating load imbalance as a function of the number of processors is presented in figure 22. The load imbalance data presented in this figure are calculated only for the number of operations required to calculate the Choleski factor of the matrix. For all ordering with different size diagonal blocks, there is perfect load-balancing for four processors, however, as the number of processors increases, so does load imbalance.

 


: Node-Tearing --- Nodes in Last Block for BCSPWR09  

 


: Node-Tearing --- Load Imbalance for Factoring BCSPWR09  

For the BCSPWR09 matrix, we have selected a maximum diagonal block size of 128 as the trade-off between load balance and minimum size of the last block. Figure 23 depicts a reordered version of the matrix presented in figure 20 with diagonal blocks assigned to a processor ordered to make those blocks contiguous. The ordering of the matrix to reflect contiguous diagonal blocks within each processor is performed by simple row and column exchanges, and has no effect on either the positive definite nature of the matrix or on the number of fillin. This is just a modification to the permutation matrix used in equation 5. In this figure, the measure of merit for the pigeon-hole load balancing algorithm is the number of factorization (multiply and addition) operations. In contrast, figure 24 depicts the same matrix after node-tearing with the exception that the load balancing has been performed as a function of the number of multiply and addition operations encountered in forward reduction and backward substitution. Note the obvious differences in block assignments to processors P2 and P4. In figures 23 and 24, the assignments of matrix partitions to processors have been denoted. Figure 25 presents plots of load imbalance verses numbers of processors for the BCSPWR09 matrix ordered with a maximum of 128 nodes. Two load imbalance curves are presented --- one for load balancing on the number of calculations in the factorization step and the second curve represents load-imbalance for load balancing on the number of calculations in the triangular solution phase. The number of processors range from two through 32 and it is clear that the load imbalance becomes significant for greater than eight processors. This graph shows that there is only a significant amount of additional load imbalance encountered for eight processors. Load imbalance for other numbers of processors is quite similar.

 


: BCSPWR09 --- Load Balancing on Factoring --- 4 Processors  

 


: BCSPWR09 --- Load Balancing on Triangular Solutions --- 4 Processors  

 


: Load Imbalance for Factoring BCSPWR09  

Similar ordering analyses have been performed for the BCSPWR10 and Niagara Mohawk load flow matrices. Examples of these matrices after applying the node-tearing algorithm and ordering both into block-diagonal-bordered form and for load-balancing are presented in figures 26 through 28. Figure 26 illustrates the block-diagonal-bordered form achievable when ordering the BCSPWR10 matrix with maximum block size of 256, while figures 27 and 28 illustrate the block-diagonal-bordered matrix form achievable when ordering the Niagara Mohawk data for maximum block sizes of 256 and 512 nodes respectively. The number of graph nodes or diagonal elements in the BCSPWR10 and Niagara Mohawk matrices are 5300 and 9430 respectively. The BCSPWR10 matrix in figure 26 has 476 equations in the lower border and last block while the Niagara Mohawk matrices in figures 27 and 28 have 402 and 313 equations in the coupling equations respectively. These graphs illustrate load balancing for four processors based on the number of calculations to perform the factorization. The Niagara Mohawk power system network graphs are similar to the BCSPWR09 power system graph in that they can be readily ordered into block-diagonal-bordered form, there are some interesting and substantial differences. Summary statistics are presented in table 1 for the three matrices and various maximum diagonal block sizes.

 


: BCSPWR10 --- Load Balancing on Factoring --- --- 4 Processors  

 


: Niagara Mohawk Data --- Load Balancing on Factoring --- --- 4 Processors  

 


: Niagara Mohawk Data --- Load Balancing on Factoring --- --- 4 Processors  

 
Table 1: Summary Statistics  

The BCSPWR10 matrix has substantially more edges per node than the BCSPWR09 matrix, which results in a significantly greater computational complexity for the factorization of the matrix. Meanwhile, the computational complexity of the triangular solution is similar to the BCSPWR09 matrix. Greater computational complexity means more work, and if that work can be distributed uniformly to all processors, then there is the potential for greater parallel implementation efficiency. Our goal is to develop scalable algorithms for Choleski factorization; however, parallelism in sparse Choleski algorithms is heavily dependent on the available parallelism in the actual sparse matrix. In section 8.3, it will be shown that the additional computations in BCSPWR10 (when compared to BCSPWR09) permits reasonable efficiencies when factoring this matrix with four processors. Meanwhile, the solver for BCSPWR09 simply does not have adequate calculations to overcome the effects of limited amounts of calculations.

The matrix from Niagara Mohawk, labeled NiMo in table 1, has similar computational complexity as BCSPWR09, however, it has nearly times as many diagonal elements and times as many non-zero elements and fillin. Careful examination of figures 27 and 28 show that for four processors, the load balancing has placed a single diagonal block on one of four processors. This block has only 247 nodes, however, it has 62% of the factorization operations. The computation complexity of this block is for factorization and for the triangular solution phase. It is possible to divide that diagonal matrix block by decreasing the maximum size of the diagonal blocks, Unfortunately, when the maximum size of a diagonal block is reduced from 247 to 128, two situations arise that generate less than optimal solutions. First, the size of the last block increases by 77%, which increases the number of computations in the last block by over 550%. While the last diagonal block is solved in parallel, the pipelined parallel dense Choleski solver used is not sufficiently scalable to be able to overcome such a large increase in number of operations by applying additional processors. Second, even when the size of the diagonal blocks are limited to a size that forces this block in question to be divided into separate matrices, the number of operations in a single block remains 31% of the total number of operations in the diagonal blocks and borders. Load imbalance would occur at greater than three processors.

Further examination of table 1 shows that the computational complexity of the forward reduction and backward substitution phases are similar in magnitude regardless of the choice of matrix. It is difficult to obtain good speedup on large multi-processors for algorithms with strong precedence in the calculations, especially when the computational complexity is so close to unity.



next up previous
Next: Empirical Results --- Up: Empirical Results Previous: Empirical Results ---



David P. Koester
Sun Oct 22 15:40:25 EDT 1995