LU factorization of block-diagonal-bordered sparse matrices has significant advantages over general sparse matrix solvers. For all but the last block, every processor has all data required for calculations performed by that processor, so all calculations for the independent blocks and borders can be performed in parallel. Moreover, LU factorization of the blocks can be performed in a highly parallel manner that doesn't require interprocessor communications. For this step, communications are only required to send the partial updates to the appropriate processors that possess data values in the last block of the matrix. Finally, the last block is factored in the most efficient manner depending on the density of that sub-matrix. Research with ordering matrices representing real power system networks has shown that it is possible to reduce the number of coupling equations and, consequently, the size of the last block to a point where this sub-matrix becomes nearly dense after fillin. Moreover, efficient dense matrix solvers have been described in the literature [4,29,30].
After the numerical factorization of the block-diagonal-bordered matrix, most of the calculations in the remaining forward reduction and backward substitution phases can also be performed in a highly parallel manner. The forward reduction stage can be performed in parallel until the last block is reduced. Communications are required only to accumulate partial sums of products from the solved variables as the border equations are reduced. Likewise, most of the calculations in the backward substitution phase can be calculated in parallel without the need for additional communications after the substitution is performed for the last block, and the results are broadcast to all processors.
This section of the paper describes the benefits of LU factorization of block-diagonal-bordered sparse matrices and also describes variants of Crout's algorithm that are applicable to parallel block-diagonal-bordered form LU factorization. Variants of the algorithm are possible depending on implementation details dictated by the nature of the sparse matrix and the multi-processor architecture.