The load balancing step of the preprocessing phase can be performed with a simple pigeon-hole type algorithm that uses one of several metrics based on the numbers of calculations determined in the pseudo-factorization step. There are three distinct steps in the proposed block-diagonal-bordered matrix solver:
When load balancing for the triangular solutions, we chose as a metric the number of non-zero elements (including fillin) in all rows except the last diagonal block. This effectively emulates the total number of calculations, although the forward reduction of the last block requires processor synchronization. As a result, there may be some room for variation in this load-balancing metric.
Regardless of which metric is used for load-balancing, there is an important point to note. These metrics do not consider indexing overhead, which can be rather extensive when sparse matrices are stored in an implicit form. The data structure used in our solver has explicit links between non-zero values in a column and stores the data in any row as a sparse vector. This data structure should minimize indexing overhead at the cost of additional memory required to store the sparse matrix when compared with other sparse data storage methods [5]. The implementation of the parallel block-diagonal-bordered Choleski solver is discussed in greater detail in section 7.
The load-balancing algorithm is a simple greedy assignment algorithm
that distributes objective function values to multiple pigeon-holes in
a manner that minimizes the disparity between the sums of objective
function values in each pigeon-hole. This is performed in a
three-step process. First the objective function values for each of
the independent blocks are placed into descending order. Second, the
greatest values are distributed to a pigeon-hole for each
processor, where
is the number of processors in a
distributed-memory multi-computer. Then the remaining objective
function values are selected in descending order and placed in the
pigeon-hole with the least aggregate workload. This algorithm is
straightforward and minimizes the disparity in aggregate workloads
between processors. This algorithm finds an optimal distribution for
workload to processors, however, actual disparity in processor
workload is dependent on the irregular sparse matrix structure. This
algorithm works best when there are minimal disparities in the
workloads for independent blocks or when there are significantly more
independent blocks than processors. In this instance, the workloads
in multiple small blocks can sum to equal the workload in a single
block with more computational workload.
The pseudo-factorization step incurs significantly more computational cost than symbolic factorization in previous sparse matrix solvers. Additionally, the ordering phase is more costly than minimum degree ordering, and load balancing is often not explicitly considered. Consequently, block-diagonal-bordered sparse matrix solvers have significantly more overhead in the preprocessing phase, and consequently, the use of this technique will be limited to problems that have static matrix structures that can reuse the ordered matrix and load balanced processor assignments multiple times in order to amortize the cost of the preprocessing phase over numerous matrix factorizations.