For the three solver implementations, there are increasing amounts of floating point calculations in double precision Choleski factorization, double precision LU factorization, and complex LU factorization, with granularity proportional to the relative workload of approximately 1:2:8. While there are differing amounts of calculations in these algorithms, there are nearly equal amounts of communications. We present sample timing comparisons for the three solver implementations in figure 7.5 for two power system networks: BCSPWR09 and BCSPWR10. These graphs each have six curves --- three each for factorization and for the triangular solution. These graphs illustrate the sensitivity that parallel sparse direct solvers for power systems networks exhibit relative to the amount of communications overhead. This sensitivity is not totally unexpected, given the extremely sparse nature of power systems network matrices. Performance is similar for the other sample power systems network matrices examined in this research.
Figure 7.5: Parallel Choleski and LU Timing Comparisons
These graphs plot the time in milliseconds that it takes to factor or calculate a triangular solution for these matrices. These graphs also show the relative number of floating point operations for the three implementations, when examining performance on a single processor. When comparing the empirical parallel performance data from the three implementations, the ratios of the times to factor the matrix decrease as additional processors are utilized. With nearly constant amounts of communications, this overhead has proportionally less of an effect when there are more calculations in the LU factorization algorithms, and it is also easier to hide communications behind calculations when more floating point operations are being performed.
These graphs also illustrate some important facts concerning parallel
triangular solutions for Choleski factorization. First, while there
is only half the calculations in the factorization step, there is no
reduction in the number of calculations in the triangular solution
step --- both forward reduction and backward substitution steps must
be performed. To calculate the triangular solution, every non-zero
coefficient in L is used once during forward reduction and every
non-zero coefficient in must also be used once during
backward substitution. While it is possible to avoid explicitly
performing the matrix transpose, one of the triangular solutions will
require additional communications overhead because the data will be
oriented inconveniently. This solution phase must incur additional
communications, proportional to the number of non-zeros in the last
diagonal block as compared to the communications for LU
factorization-based forward reduction, which would be proportional to
the number of rows or columns. The number of non-zeros is greater
than the number of rows/columns in the matrix, especially after
considering the amount of fillin. For a single processor, there is
the same amount of work when solving the factored equations for double
precision LU and Choleski. However, the effect of the additional
communications overhead has a noticeable effect on the slope of the
curve representing the triangular solution for Choleski solvers. We
observed this phenomenon for all five power systems networks.