next up previous contents
Next: Analyzing Algorithm Component Up: Empirical Results --- Previous: Comparing Timing Performance

Examining Speedup

Definition --- Relative Speedup Given a single problem with a sequential algorithm running on one processor and a concurrent algorithm running on p independent processors, relative speedup is defined as

where is the time to run the sequential algorithm as a single process and is the time to run the concurrent algorithm on p processors.

Graphs of relative speedup calculated from empirical performance data are provided in figures 48 through 50 for the three parallel direct solver implementations. Figures 48 and 49 each have two families of speedup curves that show speedup for the five power systems networks examined in this research with separate families of curves for both factorization and the triangular solution. Figure 50 has three families of curves that show speedup for the five power systems networks for factorization, forward reduction, and backward substitution. Each curve plots relative speedup for 2, 4, 8, 16, and 32 processors,

 
Figure 48: Relative Speedup --- Complex Variate LU Factorization  

 
Figure 49: Relative Speedup --- Double Precision LU Factorization  

 
Figure 50: Relative Speedup --- Choleski Factorization  

Figure 48 illustrates that parallel performance of the complex LU factorization algorithm can be as much as eighteen on 32 processors for the BCSPWR10 power systems network. Meanwhile, factorization performance for the other data sets range from eleven to nearly eight. The BCSPWR10 matrix has the most calculations, and careful examination of figures 43 through 47 show that the empirical timing data for the BCSPWR10 matrix requires a greater relative increase in time from double precision LU factorization to complex LU factorization for a single processor than other power systems networks. A significant increase in the time to factor the matrix on a single processor will cause significant increases in speedup. We believe that the unusually good performance of the parallel solver for the BCSPWR10 data is a result of caching effects --- when the program is run on one or two processors, there is too much data on each processor to fit into the fast memory cache. When more processors are used, the entire memory can fit concurrently into the fast cache, and the program runs considerably faster with all cache hits. Meanwhile, complex triangular solutions provide speedups ranging between a high of eight and a low of four.

Figure 49 illustrates that parallel performance of the double precision LU factorization algorithm can be nearly ten on 32 processors for the BCSPWR10 power systems network. Factorization performance falls between eight and seven for the other four networks. Likewise, double precision triangular solutions provide speedups ranging from slightly greater than four to a low of three.

Parallel Choleski factorization yields speedups that are less than similar LU algorithms. Empirical data for relative speedup varies between five and four for 32 processors, as illustrated in figure 50. This figure also presents empirical speedup data for forward reduction and backward substitution. Due to the significant differences in the implementation of these triangular solution algorithms, empirical data are presented for both.

Backward substitution is the simplest algorithm with the lowest communications overhead --- limited to only the broadcast of recently calculated values in when performing the triangular backward substitution on . Empirical relative speedup ranges from a high of 3.5 to 2.5. These speedups are only slightly less than speedups for backward substitution associated with LU factorization. Meanwhile, essentially no speedup has been measured for the forward reduction algorithm, due primarily to additional communications overhead for this implementation than either LU forward reduction or Choleski backward substitution. Additional communications is required to update the last diagonal using data in the borders, and there are additional communications for reducing the last diagonal block. These additional communications occur because the data distribution forces interprocessor communications of partial updates when calculating values in , rather than broadcasting values in as in the LU-based forward reduction. Interprocessor communications increase from to , or from being proportional to the number of rows/columns in the last diagonal block to being proportional to the number of non-zeros in the last diagonal block. After minimum degree ordering of the last diagonal block, . Due to the characteristics of Choleski factorization, it is inevitable that either forward reduction or backward substitution would have to deal with the problem of increased interprocessor communications [20].

The sensitivities of these parallel algorithms to communications overhead is clearly apparent when comparing the relative speedups presented in figures 48, 49, and 50. For the three solver implementations, there are increasing amounts of floating point calculations in double precision Choleski factorization, double precision LU factorization, and complex LU factorization, with a relative workload of approximately 6:2:1; however, there are nearly equal amounts of communications. Communications in block-diagonal-bordered Choleski or LU factorization occurs in two locations --- updating the last diagonal block using data in the borders and factoring the last diagonal block. There are twice as many calculations and twice as many values to distribute to processors holding data in the last diagonal block when updating the last diagonal block for LU factorization versus Choleski factorization, because LU factorization requires the update of versus only . There are equal amounts of communications for LU and Choleski factorization when factoring the last diagonal block.

When factoring the last diagonal block, the Choleski algorithm requires that data in be broadcast to all processors in the pipelined algorithm that perform the rank 1 update of the submatrix, However, the parallel algorithm for the last diagonal block requires only that be broadcast during the parallel rank 1 update. Communications overhead is nearly constant and the 6:2:1 relative workload of floating point operations result in relative speedups of 4:2:1 for the BCSPWR10 power systems network. Consequently, if speedups of 18 were required for a Choleski factorizations algorithm embedded in a real-time application, one way to reach those design goals is to improve the processor/communications performance by a factor of six to cause proportional reductions in the communications overhead. Another way that algorithm speedup could be achieved is by increasing the performance of the floating point capability of the processor, although, the ratio of communications-to-calculations performance ratio must stay equal to that in the CM-5 for these test runs.



next up previous contents
Next: Analyzing Algorithm Component Up: Empirical Results --- Previous: Comparing Timing Performance



David P. Koester
Sun Oct 22 15:31:10 EDT 1995