next up previous
Next: Empirical Results --- Up: Empirical Results Previous: Empirical Results

Empirical Results --- Sparse Choleski Solver with Machine Generated Test Data

The first step in understanding the performance potential of this block-diagonal-bordered Choleski solver is to examine the parallel algorithm performance with machine generated test data that has perfect load balance for all processors. This data has equal numbers of calculations in diagonal-blocks to eliminate requirements for load balancing. The pattern of the matrices is a recursive block-diagonal-bordered form as illustrated in figure 14. This figure depicts the sparsity structure in the test matrix, where non-zero values are black and the remainder of the matrix is comprised of zero-values. A bounding box has also been placed around the example matrix. This matrix has been used in tests to examine speedup and sources of parallel processing overhead. The block-diagonal-bordered Choleski solver has been run on this machine generated matrix using one to sixteen processors on the Thinking Machines CM-5. This matrix has sixteen separate major diagonal blocks, each with 128 diagonal elements. The last diagonal block also has 128 diagonal elements, with eight columns per diagonal block in the lower border. The size of this matrix was chosen to be similar to the BCSPWR09 test matrix, although it has a substantially higher computational complexity than the BCSPWR09 matrix, versus

 


: Machine Generated Test Matrix 

Performance data have been collected for distinct operations in the factorization and triangular solution of block-diagonal-bordered matrices to examine the speedup of each portion of the algorithm. Data was collected in such a manner as not impact the overall measures of performance. Performance of the multi-processor algorithms are illustrated using graphs plotting relative speedup versus the number of processors (one to sixteen).

Definition --- Relative Speedup Given a single problem with a sequential algorithm running on one processor and a concurrent algorithm running on p independent processors, relative speedup is defined as

where is the time to run the sequential algorithm as a single process and is the time to run the concurrent algorithm on p processors.

A graph of speedup calculated from empirical performance data for the test matrix is provided in figure 15. This figure has a family of three curves that show speedup for:

  1. Choleski factorization
  2. forward reduction and backward substitution (triangular solve)
  3. a combination of factorization and a single forward reduction and backward substitution
In general this speedup graph shows that the block-diagonal-bordered Choleski solver has significant potential for efficient operations, because speedups of over eleven were obtained with sixteen processors. This equates to a processor utilization efficiency of 70%. When examining the runtime timing data for each processor, near perfect speedup (speedup equal to the number of processors) is not achieved because of overhead that occurs during those portion of the algorithm that use asynchronous pipelined communications. The following sections of the block-diagonal-bordered Choleski solver use asynchronous pipelined communications:

 


: Speedup for Generated Test Data --- 2, 4, 8, and 16 Processors  

As a result of the regularity in this test matrix, it is possible to examine the performance of each portion of the parallel block-diagonal-bordered Choleski factorization algorithm. A graph illustrating speedup calculated from the empirical performance data is presented in figure 16. In this graph, the speedup for factorization of the diagonal blocks show perfect speedup, which is not unexpected because there is no communications in this step. However, the measured speedup for the two sections of the Choleski factorization algorithm that require communications, is not as impressive, due to communications overhead. As we use more processors, there is more communications and less work for each processor. It takes longer to fill communications pipelines and there are less calculations per processor to help mitigate the cost of communications. Consequently, there are performance limits to this algorithm. These limits are not a function of a sequential component of the parallel calculations (Amdahl's Law) [7], but rather the limits are a function of what percentage of the calculations are perfectly parallel and what percentage of the calculations have asymptotic performance limits due to communications overhead.

 


: Speedup for Choleski Factorization Algorithm Components --- 2, 4, 8, and 16 Processors  

Parallel block-diagonal-bordered Choleski factorization algorithm performance can be viewed as a linear combination of the individual performance of these three portions of the algorithm.

 

 where: ¯

is the p processor speedup for block-diagonal-bordered Choleski factorization.

is the p processor speedup for factoring of the diagonal blocks.

is the p processor speedup for updating the data in the last block using data in the border.

is the p processor speedup for factoring the last block.

is the fraction of time spent factoring of the diagonal blocks.

is the fraction of time spent updating the data in the last block using data in the border.

is the fraction of time spent factoring the last block.

Our empirical data shows that because this step has no communications, so the performance of this factorization algorithm on a problem using other block-diagonal-bordered sparse matrices can be estimated as a function of the percentage of calculations in the diagonal blocks. Similar analyses can be performed for forward reduction and backward substitution. Parallel performance of those sections of the triangular solution algorithms that require pipelined communications will have less speedup potential than the triangular solution of data in the diagonal blocks. Unfortunately, the calculations in the diagonal blocks for the triangular solution will be of a lesser computational complexity than the calculations in the factorization step. Consequently, poorer parallel performance, lower speedup, would be expected in forward reduction and backward substitution than can be obtained in block-diagonal-bordered factorization.

There are two basic forms of communications used in this algorithm.

  1. Pipelined broadcasts for those portions of the algorithm that factor or solve the last diagonal block
  2. Pipelined, asynchronous, everyone-to-everyone communications in both factorization and forward reduction when sums of matrix matrix products or vector matrix products from data in the lower border are used to update the last diagonal block or the b-vector partition associated with the last diagonal block.
In both instances, the pipelined communications appears to data-starve and not hide communications behind the calculations. This is not surprising in the forward reduction of the last block, but this phenomenon also occurs in the block factorization step. Experiments with adjusting the block size offered little improvement. While the communications have been made regular in the other communications steps, it appears that communications times are so large relative to calculations in the same processing step, that little benefit can be made of hiding communications behind calculations. Further research into other asynchronous communications techniques such as active messages [21] may be able to significantly improve performance in this area.

This analysis has been performed on data that had no load imbalance overhead. Additional sources of overhead would degrade potential performance of the algorithm [7]. We have discussed the effects of communications overhead, nevertheless, other sources of parallel processing overhead are possible. They include indexing overhead and load imbalance overhead. Indexing overhead is the extra computational cycles required to set up loops with multiple processors. Care has been exercised in the development of this algorithm to minimize indexing overhead. While it is not possible to eliminate the cost of repeating the process of setting up loops on each processor, the amount of calculations required to keep track of the values as a function of individual multi-processors can be minimized. Many of the sources of indexing overhead have been accounted for in the preprocessing stage that prepares a matrix form for processing by this algorithm. The computational cost of the preprocessing stage is expected to be amortized over multiple uses of the block-diagonal-bordered Choleski, thus the costs incurred in this stage are not addressed here. Load imbalance overhead is also addressed in the preprocessing stage, although the sources of load imbalance overhead are a function of the interconnections in the real data sets. In the preprocessing phase, we use load-balancing to attempt to order the data as to minimize this source of overhead in our parallel algorithm. We will show in section 8.3 that even with attempts at load balancing, this source of overhead cannot be entirely removed.



next up previous
Next: Empirical Results --- Up: Empirical Results Previous: Empirical Results



David P. Koester
Sun Oct 22 15:40:25 EDT 1995