next up previous
Next: Forward Reduction/Backward Substitution Up: LU Factorization of Previous: LU Factorization of

Numerical Factorization

The general nature of block-diagonal-bordered LU factorization algorithms described in this paper follows Crout's algorithm presented in section 2.3. Crout's algorithm is a fan-in algorithm that updates a row/column from previous values. During the preprocessing phase, the matrix is ordered into block-diagonal-bordered form, the location of all fillin is determined to permit static data structures, and load balancing is performed on the independent blocks in order to distribute the workload uniformly to all processors when factoring the independent blocks. The data is distributed to processors in such a manner that the data for an independent block and the corresponding portions of the borders are a assigned to the same processor. This is illustrated in figure 1(a).

The first step of the parallel algorithm performs LU factorization of the independent blocks is straightforward. Precedence permits the parallel factorization of each independent block and its associated section of border. Figure 7 depicts the precedence in the calculations for a block in an idealized block-diagonal-bordered sparse matrix. The data in each block and border portion may also be sparse. Figure 8 illustrates the element level dependencies for non-zero elements in both the diagonal block and the associated portion of the border. Note that normal fan-in precedence relationships illustrated in figure 7 at first glance suggest that data may be required from previous border sections possibly stored on other processors; however, due to the block-diagonal-bordered form of the matrix, any value that would be used for an update in a previous block is equal to zero by definition. Consequently, no interprocessor communications is required in this parallel LU factorization step, because all data is assigned to the processor that uses it to perform updates.

 
Figure 7: Precedence Relationship for Independent Blocks 

 
Figure 8: Element Level Dependencies for Independent Matrix Blocks 

The second step in the numerical factorization of a block-diagonal-bordered matrix is to update the last block using data in the borders. For a fan-in algorithm, figure 9 illustrates the required calculations for the factorization of the first column and row of the last block. However, precedence permits the updates from the borders to occur in any order as long as they occur before the column and row updates are completed. For parallel algorithms, there are definite advantages in calculating all updates from the borders at the same time. For systems where the communications latency when generating a message dominates the communications costs, limited numbers of long messages have significant advantages over numerous short messages. Performing all updates at the same time in order to generate long messages can save significant time that would otherwise be part of the communication overhead in the parallel algorithm. When implemented on processors with low latency communications support for short messages, it may be advantageous to perform these calculations in a manner that more closely resembles a general fan-in algorithm. It may even be advantageous to calculate the partial sums and store the values until the processors holding a data value that is part of the last block initiates the retrieval of a value to complete the factorization of a column or row.

 
Figure 9: Fan-in Data Pattern for the Factorization of the Last Matrix Block 

Figure 10 illustrates the element level data dependencies for updates to elements in the last block by data in the borders. To update value in the last block, the value in the upper border and the value in the lower border must be multiplied together. All other values in the same column as (within the independent block) and the same row as (within the independent block) can be multiplied together and summed to minimize communications further. Partial sums are calculated for all independent blocks and corresponding borders, so partial sums now become a function of the number of processors, rather than the number of independent blocks. The partial sums can be further reduced in parallel if the partial sums are generally non-zero. Parallel reduction of partial sums is illustrated in figure 11. If the partial sums are sparse relative to the various processors, then the partial sums should be sent directly to the corresponding processor to minimize communications. Eventually the partial sums are subtracted from the elements in the last block. Similar calculations are required for the upper triangular portion of the last block and the borders.

 
Figure 10: Element Level Dependencies for Independent Matrix Borders 

 
Figure 11: Parallel Reduction of Partial Sums 

The third step in the numerical factorization of a block-diagonal-bordered matrix is the factorization of the last block in the matrix. Due to the fillin in the matrix, it is common for the lower right portion of the last block to be nearly dense. By not explicitly storing zero values, implicit sparse matrix storage overcomes the overhead of storing row and column location indicators and pointers to the next value in a row or column. However, when a portion of a matrix becomes dense due to fillin, that portion of the matrix should be stored and factored in dense form. The most desirable situation would be to have a small last block that is significantly dense to permit explicit storage of the matrix so that this sub-matrix can be updated efficiently as a dense matrix. In this situation, the best parallel dense matrix factorization algorithm for the particular target architecture could be used to factor the last block. Otherwise, the algorithm would be more complicated and include a partitioning of the last block sub-matrix into a dense block in the lower right hand corner with the remainder of this sub-matrix being sparse. The size of the last block is dependent on the structure in the data and the ordering algorithm in the preprocessing phase.

For sparse matrices that have nearly dense last blocks, one method of enhancing LU factorization algorithms on either sequential or parallel architectures is to control the size of the last block. It is better to have a small, very dense last block rather than a larger block that is less dense, but still sufficiently dense as to require explicit storage of the matrix for efficient factorization. It can be shown that the number of multiplication and subtraction operations to factor either the lower triangular or the upper triangular portions of the last block, assuming that the data will be stored explicitly as a dense matrix, are

Where is the size of the last block. Thus, the number of calculations are , and even small increases in the size of the last block cause large increases in the number of calculations. A 25% increase in the size of the last block doubles the number of calculations, and a 60% increase in the number of equations in the last block increases the number of calculations by a factor of four. Consequently, controlling the size of the last block during the ordering step of the preprocessing phase can have significant ramifications on the run time of block-diagonal-bordered sparse matrix factorization algorithms.



next up previous
Next: Forward Reduction/Backward Substitution Up: LU Factorization of Previous: LU Factorization of



David P. Koester
Sun Oct 22 16:27:33 EDT 1995