next up previous
Next: The Preprocessing Phase Up: The Preprocessing Phase Previous: Pseudo-Factorization

Load Balancing

The load balancing step of the preprocessing phase can be performed with a simple pigeon-hole type algorithm that uses metrics based on the numbers of floating point operations determined in the pseudo-factorization step. There are three distinct steps in the block-diagonal-bordered matrix solver:

Load balancing, as implemented for factorization of the diagonal blocks and the lower border, emphasizes the uniform distribution of the processing workload in the first two steps. The second factorization step, updating the last block using data in the borders, requires that the results of sparse matrix sparse matrix products be sent to the processor that holds the data for an element in the last block. The independent nature of calculations in the diagonal blocks and the border permit a processor to start the second phase as soon as that processor has completed factoring the independent blocks. Consequently, the sum total of all calculations in the diagonal blocks and corresponding border sub-matrices can be used when load balancing. The parallel calculations in the last diagonal block are performed using a pipelined blocked kji-saxpy LU algorithm that does not require explicit load-balancing [61].

The metrics we use consider only the number of floating point operations and do not consider indexing overhead, which can be rather extensive when sparse matrices are stored in an implicit form. The data structure used in our solver has explicit links between non-zero values in a column and stores the data in any row as a sparse vector. This data structure should minimize indexing overhead at the cost of additional memory required to store the sparse matrix when compared with other sparse data storage methods [13]. The implementation of the parallel block-diagonal-bordered LU solver is discussed in greater detail in chapter gif.

The load-balancing algorithm is a simple greedy assignment algorithm that distributes objective function values to multiple pigeon-holes in a manner that minimizes the disparity between the sums of objective function values in each pigeon-hole. This is performed in a three-step process. First the objective function values for each of the independent blocks are placed into descending order. Second, the greatest values are distributed to a pigeon-hole for each processor, where is the number of processors in a distributed-memory multi-computer. Then the remaining objective function values are selected in descending order and placed in the pigeon-hole with the least aggregate workload. This algorithm is straightforward and minimizes the disparity in aggregate workloads between processors. This algorithm finds an optimal distribution for workload to processors; however, actual disparity in processor workload is dependent on the irregular sparse matrix structure. This algorithm works best when there are minimal disparities in the workloads for independent blocks or when there are significantly more independent blocks than processors. In this instance, the workloads in multiple small blocks can sum to equal the workload in a single block with more computational workload.



next up previous
Next: The Preprocessing Phase Up: The Preprocessing Phase Previous: Pseudo-Factorization



David P. Koester
Sun Oct 22 17:27:14 EDT 1995