next up previous
Next: Convergence Rate Up: Empirical Results Previous: Empirical Results

Performance Analysis

As an introduction to the performance of the parallel Gauss-Seidel algorithm, we present a pair of graphs that plot relative speedup and relative efficiency versus the number of processors. Figure 8 plots the best speedup and efficiency measured for each of the power systems matrices for 2, 4, 8, 16, and 32 processors. These graphs show that performance for the EPRI-6K data set is the best of the three data sets examined. Speedup reaches a maximum of 11.6 for 32 processors and speedups of greater than 10.0 were measured for 16 processors. This yields a relative efficiency of 63% for 16 processors and 36% for 32 processors.

 
Figure 8: Relative Speedup and Efficiency --- 2, 4, 8, 16, and 32 processors  

Relative speedups for the BCSPWR09 and BCSPWR10 matrices are less than for the EPRI-6K matrix, but each has speedup in excess of 7.0 for 16 processors. The reason for reduced performance with these matrices for larger numbers of processors is the size of the last block after ordering. For both the BCSPWR09 and BCSPWR10 matrices, the last diagonal block requires approximately 5% of the total calculations while the last block of the EPRI-6K matrix can be ordered so that only 1% of all calculations occur there. As the number of processors increases, communications overhead becomes a significant part of the overall processing time because values in the last diagonal block must be broadcast to other processors before processing can proceed to the next color. Even though the second ordering phase is able to three-color the last diagonal blocks, communications overwhelms the processing time for larger numbers of processors and minimizes speedup in this portion of the calculations. There are insufficient parallel operations when solving for in the diagonal blocks for these matrices to offset the effect of the nearly sequential last block. The effect of Amdahl's law is visible for larger numbers of processors due to the sequential nature of one portion of the algorithm.

The BCSPWR09 matrix encounters an additional problem in that the ordering phase was unable to effectively balance the workload in the portion of the software that processes all but the last block. This matrix is the smallest examined, and there is insufficient available parallelism in the matrix to support 16 or more processors.

A detailed examination of relative speedup and relative efficiency is presented in figure 9 for the EPRI-6K data. This figure contains two graphs that each have a family of four curves plotting relative speedup and relative efficiency for each of four maximum matrix partition sizes used in the node-tearing algorithm. The maximum partition sizes used when preprocessing this data are 128, 192, 256, and 320 nodes. The family of speedup curves for the various matrix orderings clearly illustrates the effects of load imbalance for some matrix orderings. For all four matrix orderings, speedup is nearly equal for 2 through 16 processors. However, the values for relative speedup diverge for 32 processors.

 
Figure 9: Relative Speedup and Efficiency for EPRI-6K Data --- 2, 4, 8, 16, and 32 processors  

We can look further into the cause of the disparity in the relative speedup values in the EPRI-6K data by examining the performance of each of the four distinct sections of the parallel algorithm. Figure 10 contains four graphs that each have a family of four curves that plot the processing time in milliseconds versus the number of processors for each of four values of from the node-tearing algorithm. The values of used when preprocessing this data are 128, 192, 256, and 320 nodes. These graphs are log-log scaled, so for perfect speedup, processing times should fall on a straight line with decreasing slope for repeated doubling of the number of processors. One or more curves on each of the performance graphs for the diagonal blocks and lower border, for updating the last diagonal block, and for convergence checks illustrate nearly perfect speedup with as many as 32 processors. Unfortunately the performance for calculating values of in the last block does not also have stellar parallel performance.

 
Figure 10: Timings for Algorithm Components --- EPRI-6K Data --- 2, 4, 8, 16, and 32 processors  

The performance graph for the diagonal blocks and lower border clearly illustrates the causes for the load imbalance observed in the relative speedup graph in figure 9. For some matrix orderings, load balancing is not able to divide the work evenly for larger numbers of processors. This always occurs for larger values of , the maximum size of a block when ordering a matrix. Nevertheless, when ordering a matrix for sixteen or more processors, selecting small values of will provide good speedup for larger numbers of processors. The performance curves presented in figure 8, shows the best performance observed for the four matrix orderings.

Performance of updating the last block by performing sparse matrix vector products and then performing irregular communications yields good performance even for 32 processors. The times to perform updates is correlated to the size of the last diagonal block, which is inversely related to the magnitude of . The relationship between the magnitude of and the size of the last block is intuitive, because as the magnitude of increases, multiple smaller blocks can be incorporated into a single block. Not only can two smaller blocks be consolidated into the single block, but in addition, any elements in the coupling equations that are unique to those network partitions could also be moved into the larger block.

The performance graph for convergence checking illustrates that the load balancing step does not assign equal numbers of rows to all processors. The number of rows on a processor varies as a function of the load balancing. While the family of curves on this graph are more erratic than the curves representing performance in diagonal blocks and the lower border and the performance of updating the last diagonal block, performance generally is improving with near perfect parallelism even for 32 processors.

Information on the relative performance as a function of and the number of processors would be required when implementing this parallel algorithm in a load-flow analysis application. To minimize the effects of data movement, the application would require that the entire process to calculate the Jacobian when solving the systems of non-linear equations consider the processor/data assignments from the sparse linear solver. The time to solve each instance of the linear equations generated by the Jacobian is so small that all data redistribution must be eliminated, otherwise, the benefits observed from parallel processing speedup in an application will be lost.

Performance of this parallel Gauss-Seidel linear solver is dependent on the performance of the matrix preprocessing phase. We must reiterate that all available parallelism in this work is a result of ordering the matrix and identifying relationships in the connectivity pattern within the structure of the matrix. Power systems load flow matrices are some of the most sparse irregular matrices encountered. For the EPRI-6K data, the mode in a histogram of the number of edges per node is only two! In other words, the most frequent number of edges at a node is only two. 84.4% of the nodes in the EPRI-6K data have three or less edges. For the BCSPWR10 matrix, 71% of the nodes have three or less edges. Consequently, power systems matrices pose some of the greatest challenges to produce efficient parallel sparse matrix algorithms.

In figures 11 and 12, we present two orderings of the EPRI-6K data with equal to 128 and 256 nodes respectively. Non-zero entries in the matrix are represented as dots, and the matrices are delimited by a bounding box. Each of these figures contain three sub-figures: the ordered sparse matrix and two enlargements of the last block --- before and after multi-coloring. Both matrices have been partitioned into block-diagonal-bordered form and load-balanced for eight processors. The numbers of nodes in the last diagonal blocks are 153 and 120 respectively, while the numbers of edges in this part of the matrix are 34 and 22 respectively. The graph multi-coloring algorithm is able to color these portions of the matrices with three and two colors respectively. These matrices represent the adjacency structure of the network graphs, and clearly illustrate the sparsity in these matrices.

 
Figure 11: Ordered EPRI-6K Matrix ---  

 
Figure 12: Ordered EPRI-6K Matrix ---  



next up previous
Next: Convergence Rate Up: Empirical Results Previous: Empirical Results



David P. Koester
Sun Oct 22 15:29:26 EDT 1995