We have also developed a parallel block-diagonal-bordered sparse Gauss-Seidel algorithm that also has been optimized for the same very sparse, irregular matrices encountered in electrical power system applications. We have developed this parallel Gauss-Seidel iterative method in order to compare the performance of parallel iterative algorithms with similar parallel direct algorithms. Block-diagonal-bordered matrix structure offers promise for simplified implementation and also offers a simple decomposition of the problem into clearly identifiable sub-problems. The node-tearing ordering heuristic has proven to be successful in identifying the hierarchical structure in the power systems matrices, and reducing the number of coupling equations so that the graph multi-coloring algorithm can often color the last block with only three colors. All available parallelism in our Gauss-Seidel algorithm is derived from within the actual interconnection relationships between elements in the matrix, and identified in the sparse matrix orderings. Consequently, available parallelism is not unlimited. Like the direct methods we have developed in this research, the parallel block-diagonal-bordered Gauss-Seidel algorithm requires a three step preprocessing phase that is reusable for static matrices. The matrix is ordered into block-diagonal-bordered form, pseudo-solved to obtain operations counts in the mutually independent diagonal blocks and corresponding portions of the borders, and load-balanced to distribute floating-point operations uniformly throughout all processors.
We have extensively analyzed the performance of parallel solvers for power systems applications on the Thinking Machines CM-5. We have shown that the node-tearing-based partitioning algorithm can yield matrices in block-diagonal-bordered form with balanced workloads for power systems networks with homogeneous voltage distribution lines; and we have shown that the performance of our parallel block-diagonal-bordered sparse iterative linear solvers can yield good speedups for Gauss-Seidel methods for those networks with balanced workloads. We have measured speedups in excess of 20 for 32 processors with this parallel sparse Gauss-Seidel algorithm.
The parallel Gauss-Seidel algorithm has proven quite sensitive to the interprocessor communications paradigm. The low-latency communications paradigm, greatly improved the performance of the algorithm, because of both reduced latency and significantly simplified implementation. The low-latency communications paradigm permits values to be sent to other processors in an asynchronous manner as soon as the values are calculated. This implementation greatly reduces the parallel software overhead, no buffers must be maintained, and also greatly reduces the amount of data required to distribute the values calculated in the last diagonal block. With total control over individual messages in the last diagonal block, as few as 10% of values that normally would be broadcast require distribution for 32 processors. As a result, we have measured good relative speedups for the low-latency communications paradigm implementation of parallel Gauss-Seidel, while the buffered communications version of this algorithm offers little or no speedup.