next up previous
Next: Analyzing Algorithm Component Up: Parallel Direct Sparse Previous: Timing Performance Comparisons

Examining Speedup

We present graphs of relative speedup calculated from empirical performance data in figure 7.6 for the three parallel direct solver implementations. The graphs for double precision and complex variate LU factorization each have two families of speedup curves that show speedup for the five power systems networks examined in this research with separate families of curves for both factorization and the triangular solution. The graph for Choleski factorization has three families of curves that show speedup for factorization, forward reduction, and backward substitution. Each curve plots relative speedup for 2, 4, 8, 16, and 32 processors,

 
Figure 7.6: Relative Speedup --- Parallel Direct Solvers  

The graph for complex variate LU factorization in figure 7.6 illustrates that parallel performance of the complex LU factorization algorithm can be as much as 18 on 32 processors for the BCSPWR10 power systems network. Meanwhile, factorization performance for the other data sets range from eight to nearly eleven. The BCSPWR10 matrix has the most calculations, and for a single processor, the empirical timing data for this matrix requires a greater relative increase in time from double precision LU factorization to complex LU factorization than other power systems networks. A significant increase in the time to factor the matrix on a single processor will cause significant increases in speedup. We believe that the unusually good performance of the parallel solver for the BCSPWR10 data is a result of caching effects --- when the program is run on one or two processors, there is too much data on each processor to fit into the fast-access cache memory. When more processors are used, there is less data per processor, and the entire portion of the matrix assigned to each processor can fit concurrently into the fast-access cache, and as a result, the program runs considerably faster.

While there is a noticeable improvement in speedup performance for complex variate LU factorization of the BCSPWR10 matrix, complex triangular solutions for this matrix do not exhibit as significant an increase in performance over the other power systems network matrices. Complex variate triangular solutions provide speedups ranging between a low of four to a high of eight.

The graph for double precision LU factorization in figure 7.6 illustrates that parallel performance of the double precision LU factorization algorithm can be nearly 10 on 32 processors for the BCSPWR10 power systems network. Factorization performance falls between seven and eight for the other four networks. Likewise, double precision triangular solutions provide speedups ranging from a low of three to slightly greater than four.

Parallel Choleski factorization yields speedups that are less than similar LU algorithms. Empirical data for relative speedup varies between four and five for 32 processors, as illustrated in the graph for Choleski factorization in figure 7.6. This figure also presents empirical speedup data for forward reduction and backward substitution. Due to the significant differences in the implementation of these triangular solution algorithms, empirical data are presented for both.

Backward substitution is the simplest algorithm with the lowest communications overhead --- limited to only the broadcast of recently calculated values in when performing the triangular backward substitution on . Empirical relative speedup ranges from 2.5 to a high of 3.5. These speedups are only slightly less than speedups for backward substitution associated with LU factorization. Meanwhile, essentially no speedup has been measured for the forward reduction algorithm, due primarily to additional communications overhead for this implementation than either LU forward reduction or Choleski backward substitution. Communications are required to update the last diagonal block using data in the borders, and there are additional communications for reducing the last diagonal block. These additional communications occur because the data distribution forces interprocessor communications of partial updates when calculating values in , rather than broadcasting values in as in the LU-based forward reduction. Interprocessor communications increase from being proportional to the number of rows/columns in the last diagonal block to being proportional to the number of non-zeros in the last diagonal block. After minimum degree ordering of the last diagonal block, the number of non-zeros may be significantly greater than the number of rows or columns. Due to the characteristics of Choleski factorization, it is inevitable that either forward reduction or backward substitution would have to deal with the problem of increased interprocessor communications [29].

The analysis in this thesis section has used relative speedup, where the sequential execution time was measured with a version of the parallel algorithm running on a single processor. The block-diagonal-bordered form matrices have additional fillin and additional calculations when compared to a matrix that has been ordered as a single large matrix. Tables gif through gif (appendix gif), present summary statistics for the power systems networks used in this analysis and show that there are between 15% and 20% additional non-zeros due to fillin in the block-diagonal-bordered matrices when compared to general minimum degree ordered matrices.

Differences in run times would be expected to be a greater percentage because of the algorithm complexity is greater that . However, when preliminary tests were run with the NiMo-OPS matrix on single processor Sun Microsystems SPARCstations, the difference in performance ranged between 5% to 10%, depending on processor type. The general sparse sequential direct solver was a fan-out algorithm, that requires access to any memory location through out the remaining matrix as rank 1 updates are performed. We hypothesize that general sequential sparse algorithm performance was less than expected due to cache access difficulties. While the general sparse algorithm requires access to the entire matrix as updates are performed, the sequential block-diagonal-bordered direct solver performs operations on limited portions of the matrix at any instance.

Due to the limited difference in performance between the sparse block-bordered-diagonal solver and the general sparse fan-out solver, sequential timings were taken from single processor runs with the parallel solver. Relative speedups reported here should differ from speedup calculated with the best sequential algorithm by no more that 10%.

The sensitivities of these parallel algorithms to communications overhead is clearly apparent when comparing the relative speedups presented in figure 7.6. Communications overhead is nearly constant and the 1:2:8 relative workload of floating point operations result in relative speedups of 1:2:4 (4.5:9:18) for the BCSPWR10 power systems network. Consequently, if speedups of 18 were required for a Choleski factorizations algorithm embedded in a real-time application, one way to reach those design goals is to improve the processor/communications performance by a factor of eight to cause proportional reductions in the communications overhead. Another way that algorithm speedup could be achieved is by increasing the performance of the floating point capability of the processor, although, the ratio of computation-to-communications must stay equal to that in the CM-5 to obtain similar parallel speedups [17].



next up previous
Next: Analyzing Algorithm Component Up: Parallel Direct Sparse Previous: Timing Performance Comparisons



David P. Koester
Sun Oct 22 17:27:14 EDT 1995