As an introduction to the performance of the parallel Gauss-Seidel algorithm, we present a pair of graphs that plot relative speedup and relative efficiency versus the number of processors. Figure 8 plots the best speedup and efficiency measured for each of the power systems matrices for 2, 4, 8, 16, and 32 processors. These graphs show that performance for the EPRI-6K data set is the best of the three data sets examined. Speedup reaches a maximum of 11.6 for 32 processors and speedups of greater than 10.0 were measured for 16 processors. This yields a relative efficiency of 63% for 16 processors and 36% for 32 processors.
Figure 8: Relative Speedup and Efficiency --- 2, 4, 8, 16, and 32 processors
Relative speedups for the BCSPWR09 and BCSPWR10 matrices are less than
for the EPRI-6K matrix, but each has speedup in excess of 7.0 for 16
processors. The reason for reduced performance with these matrices
for larger numbers of processors is the size of the last block after
ordering. For both the BCSPWR09 and BCSPWR10 matrices, the last
diagonal block requires approximately 5% of the total calculations
while the last block of the EPRI-6K matrix can be ordered so that only
1% of all calculations occur there. As the number of processors
increases, communications overhead becomes a significant part of the
overall processing time because values in the last
diagonal block must be broadcast to other processors before processing
can proceed to the next color. Even though the second ordering phase
is able to three-color the last diagonal blocks, communications
overwhelms the processing time for larger numbers of processors and
minimizes speedup in this portion of the calculations. There are
insufficient parallel operations when solving for
in
the diagonal blocks for these matrices to offset the effect of the
nearly sequential last block. The effect of Amdahl's law is visible
for larger numbers of processors due to the sequential nature of one
portion of the algorithm.
The BCSPWR09 matrix encounters an additional problem in that the ordering phase was unable to effectively balance the workload in the portion of the software that processes all but the last block. This matrix is the smallest examined, and there is insufficient available parallelism in the matrix to support 16 or more processors.
A detailed examination of relative speedup and relative efficiency is presented in figure 9 for the EPRI-6K data. This figure contains two graphs that each have a family of four curves plotting relative speedup and relative efficiency for each of four maximum matrix partition sizes used in the node-tearing algorithm. The maximum partition sizes used when preprocessing this data are 128, 192, 256, and 320 nodes. The family of speedup curves for the various matrix orderings clearly illustrates the effects of load imbalance for some matrix orderings. For all four matrix orderings, speedup is nearly equal for 2 through 16 processors. However, the values for relative speedup diverge for 32 processors.
Figure 9: Relative Speedup and Efficiency for EPRI-6K Data --- 2, 4, 8, 16, and 32 processors
We can look further into the cause of the disparity in the relative
speedup values in the EPRI-6K data by examining the performance of
each of the four distinct sections of the parallel algorithm.
Figure 10 contains four graphs that each have a family of four
curves that plot the processing time in milliseconds versus the number
of processors for each of four values of from the
node-tearing algorithm. The values of
used when
preprocessing this data are 128, 192, 256, and 320 nodes. These
graphs are log-log scaled, so for perfect speedup, processing times
should fall on a straight line with decreasing slope for repeated
doubling of the number of processors. One or more curves on each of
the performance graphs for the diagonal blocks and lower border, for
updating the last diagonal block, and for convergence checks
illustrate nearly perfect speedup with as many as 32 processors.
Unfortunately the performance for calculating values of
in the last block does not also have stellar parallel
performance.
Figure 10: Timings for Algorithm Components --- EPRI-6K Data --- 2, 4, 8, 16, and 32 processors
The performance graph for the diagonal blocks and lower border clearly
illustrates the causes for the load imbalance observed in the relative
speedup graph in figure 9. For some matrix orderings, load
balancing is not able to divide the work evenly for larger numbers of
processors. This always occurs for larger values of , the
maximum size of a block when ordering a matrix. Nevertheless, when
ordering a matrix for sixteen or more processors, selecting small
values of
will provide good speedup for larger numbers of
processors. The performance curves presented in figure 8,
shows the best performance observed for the four matrix orderings.
Performance of updating the last block by performing sparse matrix
vector
products and then performing irregular communications
yields good performance even for 32 processors. The times to perform
updates is correlated to the size of the last diagonal block, which is
inversely related to the magnitude of
. The relationship
between the magnitude of
and the size of the last block is
intuitive, because as the magnitude of
increases, multiple
smaller blocks can be incorporated into a single block. Not only can
two smaller blocks be consolidated into the single block, but in
addition, any elements in the coupling equations that are unique to
those network partitions could also be moved into the larger block.
The performance graph for convergence checking illustrates that the load balancing step does not assign equal numbers of rows to all processors. The number of rows on a processor varies as a function of the load balancing. While the family of curves on this graph are more erratic than the curves representing performance in diagonal blocks and the lower border and the performance of updating the last diagonal block, performance generally is improving with near perfect parallelism even for 32 processors.
Information on the relative performance as a function of
and the number of processors would be required when implementing this
parallel algorithm in a load-flow analysis application. To minimize
the effects of data movement, the application would require that the
entire process to calculate the Jacobian when solving the systems of
non-linear equations consider the processor/data assignments from the
sparse linear solver. The time to solve each instance of the linear
equations generated by the Jacobian is so small that all data
redistribution must be eliminated, otherwise, the benefits observed
from parallel processing speedup in an application will be lost.
Performance of this parallel Gauss-Seidel linear solver is dependent on the performance of the matrix preprocessing phase. We must reiterate that all available parallelism in this work is a result of ordering the matrix and identifying relationships in the connectivity pattern within the structure of the matrix. Power systems load flow matrices are some of the most sparse irregular matrices encountered. For the EPRI-6K data, the mode in a histogram of the number of edges per node is only two! In other words, the most frequent number of edges at a node is only two. 84.4% of the nodes in the EPRI-6K data have three or less edges. For the BCSPWR10 matrix, 71% of the nodes have three or less edges. Consequently, power systems matrices pose some of the greatest challenges to produce efficient parallel sparse matrix algorithms.
In figures 11 and 12, we present two orderings of the
EPRI-6K data with equal to 128 and 256 nodes respectively.
Non-zero entries in the matrix are represented as dots, and the
matrices are delimited by a bounding box. Each of these figures
contain three sub-figures: the ordered sparse matrix and two
enlargements of the last block --- before and after multi-coloring.
Both matrices have been partitioned into block-diagonal-bordered form
and load-balanced for eight processors. The numbers of nodes in the
last diagonal blocks are 153 and 120 respectively, while the numbers
of edges in this part of the matrix are 34 and 22 respectively. The
graph multi-coloring algorithm is able to color these portions of the
matrices with three and two colors respectively. These matrices
represent the adjacency structure of the network graphs, and clearly
illustrate the sparsity in these matrices.
Figure 11: Ordered EPRI-6K Matrix ---
Figure 12: Ordered EPRI-6K Matrix ---