The general nature of block-diagonal-bordered LU factorization algorithms described in this paper follows Crout's algorithm presented in section 2.3. Crout's algorithm is a fan-in algorithm that updates a row/column from previous values. During the preprocessing phase, the matrix is ordered into block-diagonal-bordered form, the location of all fillin is determined to permit static data structures, and load balancing is performed on the independent blocks in order to distribute the workload uniformly to all processors when factoring the independent blocks. The data is distributed to processors in such a manner that the data for an independent block and the corresponding portions of the borders are a assigned to the same processor. This is illustrated in figure 1(a).
The first step of the parallel algorithm performs LU factorization of
the independent blocks is straightforward. Precedence permits the
parallel factorization of each independent block and its associated
section of border. Figure 7 depicts the precedence
in the calculations for a block in an idealized
block-diagonal-bordered sparse matrix. The data in each block and
border portion may also be sparse. Figure 8
illustrates the element level dependencies for non-zero elements in
both the diagonal block and the associated portion of the border.
Note that normal fan-in precedence relationships illustrated in
figure 7 at first glance suggest that data may be
required from previous border sections possibly stored on other
processors; however, due to the block-diagonal-bordered form of the
matrix, any value that would be used for an update in a
previous block is equal to zero by definition. Consequently, no
interprocessor communications is required in this parallel LU
factorization step, because all data is assigned to the processor that
uses it to perform updates.
Figure 7: Precedence Relationship for Independent Blocks
Figure 8: Element Level Dependencies for Independent Matrix Blocks
The second step in the numerical factorization of a block-diagonal-bordered matrix is to update the last block using data in the borders. For a fan-in algorithm, figure 9 illustrates the required calculations for the factorization of the first column and row of the last block. However, precedence permits the updates from the borders to occur in any order as long as they occur before the column and row updates are completed. For parallel algorithms, there are definite advantages in calculating all updates from the borders at the same time. For systems where the communications latency when generating a message dominates the communications costs, limited numbers of long messages have significant advantages over numerous short messages. Performing all updates at the same time in order to generate long messages can save significant time that would otherwise be part of the communication overhead in the parallel algorithm. When implemented on processors with low latency communications support for short messages, it may be advantageous to perform these calculations in a manner that more closely resembles a general fan-in algorithm. It may even be advantageous to calculate the partial sums and store the values until the processors holding a data value that is part of the last block initiates the retrieval of a value to complete the factorization of a column or row.
Figure 9: Fan-in Data Pattern for the Factorization of the Last Matrix Block
Figure 10 illustrates the element level data
dependencies for updates to elements in the last block by data in the
borders. To update value in the last block, the value
in the upper border and the value
in the lower
border must be multiplied together. All other values in the same
column as
(within the independent block) and the same row as
(within the independent block) can be multiplied together
and summed to minimize communications further. Partial sums are
calculated for all independent blocks and corresponding borders, so
partial sums now become a function of the number of processors, rather
than the number of independent blocks. The partial sums can be
further reduced in parallel if the partial sums are generally
non-zero. Parallel reduction of partial sums is illustrated in
figure 11. If the partial sums are sparse relative
to the various processors, then the partial sums should be sent
directly to the corresponding processor to minimize communications.
Eventually the partial sums are subtracted from the elements in the
last block. Similar calculations are required for the upper
triangular portion of the last block and the borders.
Figure 10: Element Level Dependencies for Independent Matrix Borders
Figure 11: Parallel Reduction of Partial Sums
The third step in the numerical factorization of a block-diagonal-bordered matrix is the factorization of the last block in the matrix. Due to the fillin in the matrix, it is common for the lower right portion of the last block to be nearly dense. By not explicitly storing zero values, implicit sparse matrix storage overcomes the overhead of storing row and column location indicators and pointers to the next value in a row or column. However, when a portion of a matrix becomes dense due to fillin, that portion of the matrix should be stored and factored in dense form. The most desirable situation would be to have a small last block that is significantly dense to permit explicit storage of the matrix so that this sub-matrix can be updated efficiently as a dense matrix. In this situation, the best parallel dense matrix factorization algorithm for the particular target architecture could be used to factor the last block. Otherwise, the algorithm would be more complicated and include a partitioning of the last block sub-matrix into a dense block in the lower right hand corner with the remainder of this sub-matrix being sparse. The size of the last block is dependent on the structure in the data and the ordering algorithm in the preprocessing phase.
For sparse matrices that have nearly dense last blocks, one method of enhancing LU factorization algorithms on either sequential or parallel architectures is to control the size of the last block. It is better to have a small, very dense last block rather than a larger block that is less dense, but still sufficiently dense as to require explicit storage of the matrix for efficient factorization. It can be shown that the number of multiplication and subtraction operations to factor either the lower triangular or the upper triangular portions of the last block, assuming that the data will be stored explicitly as a dense matrix, are
Where is the size of the last block. Thus, the number of
calculations are
, and even small increases in the size
of the last block cause large increases in the number of calculations.
A 25% increase in the size of the last block doubles the number of
calculations, and a 60% increase in the number of equations in the
last block increases the number of calculations by a factor of four.
Consequently, controlling the size of the last block during the
ordering step of the preprocessing phase can have significant
ramifications on the run time of block-diagonal-bordered sparse matrix
factorization algorithms.