The first step in understanding the performance potential of this
block-diagonal-bordered Choleski solver is to examine the parallel
algorithm performance with machine generated test data that has
perfect load balance for all processors. This data has equal numbers
of calculations in diagonal-blocks to eliminate requirements for load
balancing. The pattern of the matrices is a recursive
block-diagonal-bordered form as illustrated in figure 14.
This figure depicts the sparsity structure in the test matrix, where
non-zero values are black and the remainder of the matrix is comprised
of zero-values. A bounding box has also been placed around the
example matrix. This matrix has been used in tests to examine speedup
and sources of parallel processing overhead. The
block-diagonal-bordered Choleski solver has been run on this machine
generated matrix using one to sixteen processors on the Thinking
Machines CM-5. This matrix has sixteen separate major diagonal
blocks, each with 128 diagonal elements. The last diagonal block also
has 128 diagonal elements, with eight columns per diagonal block in
the lower border. The size of this matrix was chosen to be similar to
the BCSPWR09 test matrix, although it has a substantially higher
computational complexity than the BCSPWR09 matrix,
versus
: Machine Generated Test Matrix
Performance data have been collected for distinct operations in the factorization and triangular solution of block-diagonal-bordered matrices to examine the speedup of each portion of the algorithm. Data was collected in such a manner as not impact the overall measures of performance. Performance of the multi-processor algorithms are illustrated using graphs plotting relative speedup versus the number of processors (one to sixteen).
Definition --- Relative Speedup Given a single problem with a sequential algorithm running on one processor and a concurrent algorithm running on p independent processors, relative speedup is defined as
where is the time to run the sequential algorithm as a
single process and
is the time to run the concurrent algorithm
on p processors.
A graph of speedup calculated from empirical performance data for the test matrix is provided in figure 15. This figure has a family of three curves that show speedup for:
: Speedup for Generated Test Data --- 2, 4, 8, and 16 Processors
As a result of the regularity in this test matrix, it is possible to examine the performance of each portion of the parallel block-diagonal-bordered Choleski factorization algorithm. A graph illustrating speedup calculated from the empirical performance data is presented in figure 16. In this graph, the speedup for factorization of the diagonal blocks show perfect speedup, which is not unexpected because there is no communications in this step. However, the measured speedup for the two sections of the Choleski factorization algorithm that require communications, is not as impressive, due to communications overhead. As we use more processors, there is more communications and less work for each processor. It takes longer to fill communications pipelines and there are less calculations per processor to help mitigate the cost of communications. Consequently, there are performance limits to this algorithm. These limits are not a function of a sequential component of the parallel calculations (Amdahl's Law) [7], but rather the limits are a function of what percentage of the calculations are perfectly parallel and what percentage of the calculations have asymptotic performance limits due to communications overhead.
: Speedup for Choleski Factorization Algorithm Components --- 2, 4, 8, and 16 Processors
Parallel block-diagonal-bordered Choleski factorization algorithm performance can be viewed as a linear combination of the individual performance of these three portions of the algorithm.
where: ¯Our empirical data shows that
is the p processor speedup for block-diagonal-bordered Choleski factorization.
is the p processor speedup for factoring of the diagonal blocks.
is the p processor speedup for updating the data in the last block using data in the border.
is the p processor speedup for factoring the last block.
is the fraction of time spent factoring of the diagonal blocks.
is the fraction of time spent updating the data in the last block using data in the border.
is the fraction of time spent factoring the last block.
![]()
There are two basic forms of communications used in this algorithm.
This analysis has been performed on data that had no load imbalance overhead. Additional sources of overhead would degrade potential performance of the algorithm [7]. We have discussed the effects of communications overhead, nevertheless, other sources of parallel processing overhead are possible. They include indexing overhead and load imbalance overhead. Indexing overhead is the extra computational cycles required to set up loops with multiple processors. Care has been exercised in the development of this algorithm to minimize indexing overhead. While it is not possible to eliminate the cost of repeating the process of setting up loops on each processor, the amount of calculations required to keep track of the values as a function of individual multi-processors can be minimized. Many of the sources of indexing overhead have been accounted for in the preprocessing stage that prepares a matrix form for processing by this algorithm. The computational cost of the preprocessing stage is expected to be amortized over multiple uses of the block-diagonal-bordered Choleski, thus the costs incurred in this stage are not addressed here. Load imbalance overhead is also addressed in the preprocessing stage, although the sources of load imbalance overhead are a function of the interconnections in the real data sets. In the preprocessing phase, we use load-balancing to attempt to order the data as to minimize this source of overhead in our parallel algorithm. We will show in section 8.3 that even with attempts at load balancing, this source of overhead cannot be entirely removed.