next up previous
Next: Parallel Sparse Direct Up: Parallel Sparse Matrix Previous: Parallel Sparse Matrix

Sequential Code Optimization

In this chapter, we describe the implementations of the parallel block-diagonal-bordered sparse matrix solvers that we have developed for the Thinking Machines CM-5 distributed-memory multi-processor. All implementations have been developed with special concern for sequential code optimization, in addition to optimization of the more complex parallel code. Paramount to sequential code optimization is matching the data structures to the algorithm in order to maximize cache hits. This can be accomplished by attempting to always perform operations on vectors of consecutively stored data. The algorithm for the iterative Gauss-Seidel method makes this easy. The Gauss-Seidel algorithm is essentially a matrix vector product with multiple vector vector products as new values of are calculated for the row. The sparse implementation of Gauss-Seidel requires the calculation of a sparse matrix a dense vector product. Figure gif illustrates the optimal way to perform the matrix vector product, with the (sparse) matrix stored in row-major form. In this figure, long horizontal lines depict sparse vectors and the vertical line representing is a dense vector.

 
Figure: Optimal Row-Major Data Storage for Gauss-Seidel Algorithms 

Algorithms for direct linear solvers have much more complicated access patterns. For dense direct methods, normal storage of data in conventional two-dimensional matrices always leaves the data in a pattern that is not accessible in one direction as a vector, and unless the software is written to optimize over strides in the matrix, there may be cache hit problems when performing dense factorization. Figure gif illustrates four possible dense factorization implementations, two each for column-major storage and row-major storage [30]. In this figure, long lines depict dense vectors and dots represent scalar access.

 
Figure: Optimal Data Storage for Dense Factorization 

For sparse direct linear solvers, we store the sparse matrices implicitly, providing rich opportunities for improving sequential program cache hits. It is simple to store data with separate data structures for the diagonal, lower triangular, and upper triangular portions of the matrix and have sparse vector sparse vector products throughout the factorization calculations. This is illustrated in figure gif for a factorization algorithm with lower triangular matrix data stored in row-major order and upper triangular matrix storage in column-major order. Care must be taken to use forward reduction and backward substitution algorithms compatible with the data structures that optimize performance. This is easy to perform, because the loops to perform the triangular solutions can be performed in either order.

 
Figure: Optimal Data Storage for Sparse Factorization --- Doolittle's Algorithm 

Now we examine methods to optimize parallel direct and parallel iterative solver implementations.



next up previous
Next: Parallel Sparse Direct Up: Parallel Sparse Matrix Previous: Parallel Sparse Matrix



David P. Koester
Sun Oct 22 17:27:14 EDT 1995