In this paper we present research into parallel block-diagonal-bordered sparse direct linear solver algorithms developed with special considerations to solve irregular sparse matrices originating in the electrical power systems community. Available parallelism in the block-diagonal-bordered matrix structure offers promise for simplified implementation and also offers a simple decomposition of the problem into clearly identifiable subproblems. Parallel block-diagonal-bordered direct linear solvers require a three step preprocessing phase that is reusable for static matrices. The matrix is ordered into block-diagonal-bordered form, pseudo-factored to identify the location of all fillin and obtain operations counts in the mutually independent diagonal blocks and corresponding portions of the borders, and the load-balanced to uniformly distribute operations.
We developed an implementation that offered speedups of nearly ten for double precision LU factorization and even greater speedups for complex variate LU factorization with 32 processors. Speedups for parallel block-diagonal-bordered Choleski factorization were less than for LU factorization, and there are formidable problems implementing forward reduction due to data distribution. We have performed additional research into parallel block-diagonal-bordered sparse Gauss-Seidel algorithms, an iterative linear solution technique [24,25]. We are able to get substantially better speedups with the parallel Gauss-Seidel algorithm, although the only matrix types that there is assurance of convergence for Gauss-Seidel are diagonally dominant and positive definite matrices. Moreover, Choleski factorization is limited to either symmetric diagonally dominant or symmetric positive definite matrices. Consequently, we have compared the performance of parallel sparse Choleski solvers and parallel sparse Gauss-Seidel algorithms.
Power systems applications use sparse linear solvers in conjunction with either non-linear equation or differential-algebraic equation solvers. Often applications reuse a factored matrix numerous times, as a trade-off is made between the computational costs of repeated factorization and additional iterations in the non-linear equation solvers. A new LU factorization is not calculated every iteration -- instead, an old LU decomposition is used to solve an approximate linear system. A new factorization is only calculated every few iterations. The cost of multiple linear solutions for dishonest reuse would be a linear combination of the cost for factorization plus the cost for the repeated number of factorization (re)uses.
We compare the performance of parallel Choleski solvers with parallel
iterative Gauss-Seidel solvers by determining the number of iterations
for the parallel Gauss-Seidel given a number of (re)uses.
Families of curves plotting the number of iterations versus the number
of dishonest (re)uses are presented in figures 61
and 62 for one through ten reuses and one through 32
processors for power systems networks BCSPWR09 and BCSPWR10. The
shape of the curves show that the largest number of iterations
possible for a constant time solution occur for a single use of the
factored matrix. As the factorization is (re)used, the cost to
factor the matrix is amortized over the additional (re)uses.
For large numbers of factorization (re)uses, the curve becomes
asymptotic to .
Figure 61: Gauss-Seidel Iterations as a Function of Dishonest Reuses of LU Matrix --- BCSPWR09
Figure 62: Gauss-Seidel Iterations as a Function of Dishonest Reuses of LU Matrix --- BCSPWR10
Figure 61 illustrates that 12 Gauss-Seidel iterations take as much time as a single factorization and triangular solution for the BCSPWR09 operations matrix on a single processor. Meanwhile, only four iterations per solution would equal the time for 10 dishonest (re)uses. However, when 32 processors are available, 54 Gauss-Seidel iterations could be performed in the same time as a single direct solution, and 24 iterations per solution for 10 dishonest (re)uses. Figure 62 illustrates similar performance for the BCSPWR10 power systems matrix --- nearly 120 Gauss-Seidel iterations could be performed in the same time as a single direct solution for 32 processors, and 55 iterations per solution for 10 dishonest (re)uses. Given that there are good starting points for each successive iterative solution, there is a strong possibility that the use of parallel Gauss-Seidel should yield significant algorithmic speedups for diagonally dominant or positive definite sparse matrices.
The parallel block-diagonal-bordered direct solvers, presented in this paper, address the most difficult power systems applications to implement on a multi-processor --- solutions relating only to power system networks. Load-flow has the smallest matrices and the fewest calculations due to symmetry and lack of requirements for pivoting to ensure numerical stability. LU factorization of network equations for decoupled solutions of differential-algebraic equations has additional calculations, but often is solved without numerical pivoting. We have shown in this paper that by simply increasing the number of floating point operations by a factor of six, parallel speedup of the algorithm improves significantly.
Parallel block-diagonal-bordered sparse linear solver algorithms can readily be extended to applications that have power systems networks as a small portion of a larger matrix, for example, the entire system of linearized differential-algebraic equations encountered in transient stability analysis or small-signals analysis applications. These applications add many natural blocks of linearized differential equations that significantly increase the size of the matrix. The linearized differential equations are less-sparse than the network equations and may require pivoting to ensure numerical stability. Pivoting for this matrix would be limited to within diagonal-blocks to place limits on fillin, but the efficient static data structures would need to be replaced by less-efficient dynamic linked-list-based data structures. Any of these modifications would increase computational workload --- work that does not require interprocessor interactions. As a result, any modifications to algorithms to include these additional features would improve parallel speedup, for the Thinking Machines CM-5 and for future machines that will be significantly faster than the MPPs and SPPs of today.