Research is in progress to examine efficient parallel algorithms for the direct solution of the sparse systems of equations
when the sparse block-diagonal-bordered A matrix is factored into upper and lower triangular matrices, e.g., A=LU. Forward reduction and backward substitution steps are then required to solve for the vector x in the initial system of linear equations. The parallel direct solution of sparse matrices in block-diagonal-bordered form is performed in three distinct steps:
Figure 1: Block Bordered Diagonal Form Sparse Matrix Solution Steps
LU factorization of block-diagonal-bordered sparse matrices has significant advantages over general sparse matrix solvers. First, for all but the last block, every processor is assigned all the data required for calculations performed by that processor. Second, all calculations in the independent blocks and borders can be performed in parallel, without requiring communications to fetch data to perform updates. The LU factorization of the diagonal blocks can be performed in parallel without requiring interprocessor communications, and the numerical updates of the last block utilizing data in the borders, can also be performed in parallel. For this step, communications are only required to send partial sums of updates to the appropriate processors that possess data values in the last block of the matrix. Interprocessor communications are significantly reduced and are now also uniform and structured. Finally, the last block is factored in the most efficient manner depending on the density of this sub-matrix. The factorization steps are illustrated in part (a) of figure 1.
After the numerical factorization of the block-diagonal-bordered matrix, most of the calculations in the remaining forward reduction and backward substitution phases can also be performed in parallel without requiring communications. The reduction/substitution steps are illustrated in part (b) of figure 1. The forward reduction stage can be performed in parallel until the last block is reduced. Communications are required to accumulate partial sums of products as the border equations are reduced. Likewise, most of the calculations in the backward substitution phase can be calculated in a highly parallel manner after the substitution is performed for the last block, and the results are broadcast to all processors.
Block-diagonal-bordered matrix forms offer significant advantages when solving sparse linear systems of equations. In a manner dissimilar to sparse linear solvers described in [7,8,9,12,23,24,25,26,27,31], the task assignments for parallel numerical factorization of a block-diagonal-bordered matrix depend only on the assignment of independent blocks to processors and the processor assignments of data in the last diagonal block. However, additional computational work must be performed in the initial stages that determine which independent blocks are assigned to particular processors.
Block-diagonal-bordered sparse matrix algorithms require modifications to the normal preprocessing phase. Each of the numerous papers referenced above use the paradigm to order the sparse matrix and then perform symbolic factorization in order to determine the locations of all fillin values so that static data structures can be utilized for maximum efficiency when performing numerical factorization. We propose modifying this commonly used sparse matrix preprocessing phase to include an explicit load balancing step coupled to the ordering step so that the workload is uniformly distributed throughout a distributed-memory multi-processor and parallel algorithms make efficient use of the computational resources.
To transform a sparse matrix into block-diagonal-bordered form requires relatively sophisticated ordering techniques, and also requires that load balancing be performed based on the number of calculations in each independent block. Due to the poor correlation of the number of rows or columns in a sparse matrix independent block with the workload, the actual number of calculations in each independent block must be determined in a pseudo factorization step during the preprocessing phase. Pseudo factorization also determines the location of all fillin so that efficient static data structures can be utilized when pivoting for numerical stability is not required. The three-step preprocessing phase required for efficient factorization of this matrix form will be more computationally intensive than the simple algorithms required to prepare a matrix for numerical factorization presented in the recent literature. The additional processing requirements for these steps limit the applicability of this technique to repetitive solutions of static network structures, where the additional effort can be amortized over multiple solutions of similar linear systems.