next up previous
Next: Examining Speedup Up: Parallel Direct Sparse Previous: Selecting Partitioned Matrices

Timing Performance Comparisons

For the three solver implementations, there are increasing amounts of floating point calculations in double precision Choleski factorization, double precision LU factorization, and complex LU factorization, with granularity proportional to the relative workload of approximately 1:2:8. While there are differing amounts of calculations in these algorithms, there are nearly equal amounts of communications. We present sample timing comparisons for the three solver implementations in figure 7.5 for two power system networks: BCSPWR09 and BCSPWR10. These graphs each have six curves --- three each for factorization and for the triangular solution. These graphs illustrate the sensitivity that parallel sparse direct solvers for power systems networks exhibit relative to the amount of communications overhead. This sensitivity is not totally unexpected, given the extremely sparse nature of power systems network matrices. Performance is similar for the other sample power systems network matrices examined in this research.

 
Figure 7.5: Parallel Choleski and LU Timing Comparisons  

These graphs plot the time in milliseconds that it takes to factor or calculate a triangular solution for these matrices. These graphs also show the relative number of floating point operations for the three implementations, when examining performance on a single processor. When comparing the empirical parallel performance data from the three implementations, the ratios of the times to factor the matrix decrease as additional processors are utilized. With nearly constant amounts of communications, this overhead has proportionally less of an effect when there are more calculations in the LU factorization algorithms, and it is also easier to hide communications behind calculations when more floating point operations are being performed.

These graphs also illustrate some important facts concerning parallel triangular solutions for Choleski factorization. First, while there is only half the calculations in the factorization step, there is no reduction in the number of calculations in the triangular solution step --- both forward reduction and backward substitution steps must be performed. To calculate the triangular solution, every non-zero coefficient in L is used once during forward reduction and every non-zero coefficient in must also be used once during backward substitution. While it is possible to avoid explicitly performing the matrix transpose, one of the triangular solutions will require additional communications overhead because the data will be oriented inconveniently. This solution phase must incur additional communications, proportional to the number of non-zeros in the last diagonal block as compared to the communications for LU factorization-based forward reduction, which would be proportional to the number of rows or columns. The number of non-zeros is greater than the number of rows/columns in the matrix, especially after considering the amount of fillin. For a single processor, there is the same amount of work when solving the factored equations for double precision LU and Choleski. However, the effect of the additional communications overhead has a noticeable effect on the slope of the curve representing the triangular solution for Choleski solvers. We observed this phenomenon for all five power systems networks.



next up previous
Next: Examining Speedup Up: Parallel Direct Sparse Previous: Selecting Partitioned Matrices



David P. Koester
Sun Oct 22 17:27:14 EDT 1995