We present graphs of relative speedup calculated from empirical performance data in figure 7.6 for the three parallel direct solver implementations. The graphs for double precision and complex variate LU factorization each have two families of speedup curves that show speedup for the five power systems networks examined in this research with separate families of curves for both factorization and the triangular solution. The graph for Choleski factorization has three families of curves that show speedup for factorization, forward reduction, and backward substitution. Each curve plots relative speedup for 2, 4, 8, 16, and 32 processors,
Figure 7.6: Relative Speedup --- Parallel Direct Solvers
The graph for complex variate LU factorization in figure 7.6 illustrates that parallel performance of the complex LU factorization algorithm can be as much as 18 on 32 processors for the BCSPWR10 power systems network. Meanwhile, factorization performance for the other data sets range from eight to nearly eleven. The BCSPWR10 matrix has the most calculations, and for a single processor, the empirical timing data for this matrix requires a greater relative increase in time from double precision LU factorization to complex LU factorization than other power systems networks. A significant increase in the time to factor the matrix on a single processor will cause significant increases in speedup. We believe that the unusually good performance of the parallel solver for the BCSPWR10 data is a result of caching effects --- when the program is run on one or two processors, there is too much data on each processor to fit into the fast-access cache memory. When more processors are used, there is less data per processor, and the entire portion of the matrix assigned to each processor can fit concurrently into the fast-access cache, and as a result, the program runs considerably faster.
While there is a noticeable improvement in speedup performance for complex variate LU factorization of the BCSPWR10 matrix, complex triangular solutions for this matrix do not exhibit as significant an increase in performance over the other power systems network matrices. Complex variate triangular solutions provide speedups ranging between a low of four to a high of eight.
The graph for double precision LU factorization in figure 7.6 illustrates that parallel performance of the double precision LU factorization algorithm can be nearly 10 on 32 processors for the BCSPWR10 power systems network. Factorization performance falls between seven and eight for the other four networks. Likewise, double precision triangular solutions provide speedups ranging from a low of three to slightly greater than four.
Parallel Choleski factorization yields speedups that are less than similar LU algorithms. Empirical data for relative speedup varies between four and five for 32 processors, as illustrated in the graph for Choleski factorization in figure 7.6. This figure also presents empirical speedup data for forward reduction and backward substitution. Due to the significant differences in the implementation of these triangular solution algorithms, empirical data are presented for both.
Backward substitution is the simplest algorithm with the lowest
communications overhead --- limited to only the broadcast of recently
calculated values in when performing the triangular
backward substitution on
. Empirical relative
speedup ranges from 2.5 to a high of 3.5. These speedups are only
slightly less than speedups for backward substitution associated with
LU factorization. Meanwhile, essentially no speedup has been measured
for the forward reduction algorithm, due primarily to additional
communications overhead for this implementation than either LU forward
reduction or Choleski backward substitution. Communications are
required to update the last diagonal block using data in the borders,
and there are additional communications for reducing the last diagonal
block. These additional communications occur because the data
distribution forces interprocessor communications of partial updates
when calculating values in
, rather than broadcasting
values in
as in the LU-based forward reduction.
Interprocessor communications increase from being proportional to the
number of rows/columns in the last diagonal block to being
proportional to the number of non-zeros in the last diagonal block.
After minimum degree ordering of the last diagonal block, the number
of non-zeros may be significantly greater than the number of rows or
columns. Due to the characteristics of Choleski factorization, it is
inevitable that either forward reduction or backward substitution
would have to deal with the problem of increased interprocessor
communications
[29].
The analysis in this thesis section has used relative speedup, where
the sequential execution time was measured with a version of the
parallel algorithm running on a single processor. The
block-diagonal-bordered form matrices have additional fillin and
additional calculations when compared to a matrix that has been
ordered as a single large matrix. Tables
through
(appendix
),
present summary statistics for the power systems networks used in this
analysis and show that there are between 15% and 20% additional
non-zeros due to fillin in the block-diagonal-bordered matrices when
compared to general minimum degree ordered matrices.
Differences in run times would be expected to be a greater percentage
because of the algorithm complexity is greater that . However,
when preliminary tests were run with the NiMo-OPS matrix on single
processor Sun Microsystems SPARCstations, the difference in
performance ranged between 5% to 10%, depending on processor type.
The general sparse sequential direct solver was a fan-out algorithm,
that requires access to any memory location through out the remaining
matrix as rank 1 updates are performed. We hypothesize that general
sequential sparse algorithm performance was less than expected due to
cache access difficulties. While the general sparse algorithm
requires access to the entire matrix as updates are performed, the
sequential block-diagonal-bordered direct solver performs operations
on limited portions of the matrix at any instance.
Due to the limited difference in performance between the sparse block-bordered-diagonal solver and the general sparse fan-out solver, sequential timings were taken from single processor runs with the parallel solver. Relative speedups reported here should differ from speedup calculated with the best sequential algorithm by no more that 10%.
The sensitivities of these parallel algorithms to communications overhead is clearly apparent when comparing the relative speedups presented in figure 7.6. Communications overhead is nearly constant and the 1:2:8 relative workload of floating point operations result in relative speedups of 1:2:4 (4.5:9:18) for the BCSPWR10 power systems network. Consequently, if speedups of 18 were required for a Choleski factorizations algorithm embedded in a real-time application, one way to reach those design goals is to improve the processor/communications performance by a factor of eight to cause proportional reductions in the communications overhead. Another way that algorithm speedup could be achieved is by increasing the performance of the floating point capability of the processor, although, the ratio of computation-to-communications must stay equal to that in the CM-5 to obtain similar parallel speedups [17].