Next: Convergence Rate
Up: Parallel Sparse Gauss-Seidel
Previous: Examining Speedup
We next present a detailed analysis of the performance of the
component parts of the parallel block-diagonal-bordered Gauss-Seidel
algorithm. We present graphs that show the time in milliseconds to
perform each of the component operations of the algorithm:
- calculate
in the diagonal blocks,
- update
using values in the lower border,
- calculate
using the values of
and the last diagonal block
,
- perform a convergence check.
Detailed parallel algorithm analysis will demonstrate that
the preprocessing phase can effectively load balance the matrix for
as many as sixteen processors for all networks examined and can
effectively load balance the matrix for as many as 32 processors for
certain classes of data sets. We present graphs that illustrate algorithm
component performance in figure 7.16. Each graph has four
curves that show parallel Gauss-Seidel component performance for a
single iteration.
Figure 7.16: Algorithm Component Timing Data --- Double Precision Gauss-Seidel
These figures corroborate the results of the previous two sections
that identified load imbalance for the two planning networks ---
EPRI6K and NiMo-PLANS. The graphs with performance data from the
BCSPWR09, BCSPWR10, and NiMo-OPS power systems matrices show good load
balancing for the diagonal blocks and lower border; however, the
graphs for the EPRI6K and NiMo-PLANS data show degraded performance
for 32 processors. Load imbalance is evident when empirical
performance data for calculating
in the diagonal
blocks does not yield a straight line. Load imbalance is also the
likely cause that the slope of individual curve splines, both for
updating
in the last diagonal block and for performing
convergence checks, do not have constant slope. Previous graphs
showed little effect by increasing the computation-to-calculations
granularity, so any degraded performance would be due to sources of
overhead other than communications overhead.
The times to calculate
in the
multi-colored last diagonal block are always the least of the four
operations for all five power systems networks, and for all but the
planning matrices, the time to solve for
is monotonically decreasing. Communications overhead, if it exists,
would occur in this algorithm component as the number of processors
increases.
We draw the following conclusions from this detailed examination of
the parallel Gauss-Seidel algorithm components:
- The low-latency, active message-based implementations are
able to obtain good performance improvements for all algorithm
components as the number of processors increase, even when
solving for
for small operations
matrices.
- Power systems networks can vary greatly --- planning networks
may be larger than operations networks, and these matrices have
different characteristics. Planning matrices are likely to have
the poorest performance for 32 processors due to load imbalance.
For operations matrices, performance times for every component are
monotonically decreasing, illustrating good load balance. For
planning matrices, performance at 16 and 32 processors show the
limitations of our preprocessing phase to order these matrices for
the parallel Gauss-Seidel algorithm. Nevertheless, this parallel
block-diagonal-bordered algorithm can get speedups of over 20 for
32 processors with large power systems networks with homogeneous
voltage lines throughout the entire matrix. Operations matrices
demonstrate performance that is nearly as good.
Next: Convergence Rate
Up: Parallel Sparse Gauss-Seidel
Previous: Examining Speedup
David P. Koester
Sun Oct 22 17:27:14 EDT 1995