As stated above, the metric for performing load balancing or for comparing the performance of ordering techniques must be based on the actual workload required by the processors in a distributed-memory multi-computer. Consequently, more information is required than just the locations of fillin as in previous work that used symbolic factorization to identify fillin for static data structures [16,20,38].
To accomplish the two-fold requirement for both identifying the
location of fillin and determining the amount of calculations in each
independent block, we utilize a pseudo-factorization step as part of
the preprocessing phase. Pseudo-factorization is merely a replication
of the numerical factorization process without actually performing the
calculations. Counters are used to tally the numbers of calculations
to factor the independent data blocks and the numbers of calculations
to update the last block using data from the borders. In addition,
while performing the pseudo-factorization step, it is also simple to
keep track of the number of operations that would be required when
performing the triangular solutions. By collecting this data, it
provides an option to order the matrix to optimize the number of
calculations per processor in the factorization step, or to optimize
the number of calculations in the triangular solution steps. Often
the matrix calculated by factorization is utilized
multiple times in dishonest iterative numerical solutions. As a
result, some applications may require special attention to maximum
efficiency in the forward reduction and backward substitution steps.
There is no way to avoid the computational expense of this preprocessing step, because the computational workload in factorization is not correlated with the number of equations in an independent block. The number of calculations when factoring an independent sub-block is a function of the number and location of non-zero matrix elements in that block --- not necessarily the number of equations in the block. Efficient parallel sparse matrix solvers require that any disparities in processor workloads be minimized in order to minimize load imbalance overhead, and consequently, to maximize processor utilization.