Preliminary empirical data collected using the block-diagonal-bordered sparse matrix solver has been collected on the Thinking Machines CM-5. In spite of the fact that the parallel implementation has instrumentation messages creating additional communications overhead, and the last block is factored sequentially on the host processor, the speedup for the parallel implementation performance has been substantial for large sparse matrices. The data used in the tests has been machine generated with a regular pattern, although the software implementation does not exploit the data regularities --- the matrix was solved as if it were irregular. A regular data matrix was used because of the ease with which the location of all fillin values could be determined. In order to eliminate the need for load balancing, the number of independent blocks used in each test matrix is a multiple of the number of available processors on the Northeast Parallel Architectures Center (NPAC) CM-5.
An example of the test matrices is presented in figure 19. This figure depicts the sparsity structure in a matrix, where non-zero matrix elements can be interpreted as either the edges in a graph or they can be interpreted as non-zero coefficients in sparse linear equations. Non-zero values are black, zero values are light gray, and fillin values are medium gray. This matrix is the smallest test matrix for which performance data is presented in this paper, and it has 32 independent blocks, with four sub-blocks and four separate sets of coupling equations in the sparse borders in each independent block. Matrices with larger numbers of equations are generated by increasing the number of independent blocks and the corresponding coupling equations in the borders. The choice of this test matrix form has been motivated partially by the simplicity of the form in order to generate rapidly regular matrices for code verification and preliminary benchmarking purposes, but also motivated by the desire to expose the algorithms described in the previous section to controlled data in order to investigate the performance of each portion of the sequential and parallel block-diagonal-bordered sparse matrix algorithms. The use of these controlled matrices limits all load imbalance overhead in subsequent benchmarking. As a result, processing bottlenecks are not masked by artifacts in the data. When working with irregular sparse matrices, it will be the task of the load balancing step to minimize any load imbalance overhead. Eventually, performance of the algorithms will be dependent on the inherent structure of real, irregular sparse matrices. Moreover, we are examining matrix partitioning and load balancing algorithms that will yield better performance than nested dissection algorithms on irregular data.
Figure 19: Sample Test Matrix With 32 Independent Blocks